Output Units and Cost Function in FNN

IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model

Concludsions and Discussions

Deep Neural NetworkCost Functions and Output Units

Jiaming Linjmlin@arbor.ee.ntu.edu.tw

DATALab@IIINetDBLab@NTU

January 9, 2017

1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

Outline

1 Introduction

2 Output Units and Cost FunctionsBinaryMultinoulli

3 Deterministic and Generic Model

4 Concludsions and Discussions

Introduction

In the neural network learning...

The selection of output unit depends on the learningproblems.

– Classification: sigmoid, softmax or linear.– Linear Regression: linear.

Determine and analyse the cost function.

– Is the cost function †analytic?– Can the learning progress well(first order derivative)?

Deterministic and Generic Model.

– Data is more complicated in many cases.

Note: †For simplicity, we mean analytic to say a function isinfinitely differentiable on the domain.

Introduction

BinaryMultinoulli

Outline

1 Introduction

BinaryMultinoulli

Outline

1 Introduction

BinaryMultinoulli

Binary

index x1 · · · xn target1 0 · · · 1 Class A2 1 · · · 0 Class B3 1 · · · 1 Class A· · · · · · · · · · · · · · ·m 0 · · · 0 Class B

BinaryMultinoulli

Binary

whereS is the sigmoid function,z is the input of output layer

z = w>h+ b (1)

with w is weight, h is output of hidden layer and b is bias.6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

BinaryMultinoulli

Cost Function

Cost function can be derived from many methods, we discusstwo of the most common

Mean Square Error

Cross Entropy

BinaryMultinoulli

Cost Function

Mean Square Error

Let y(i) denotes the data label, and y(i) = S(z(i)) as theprediction. We may define the cost function Cmse by

Cmse =1

m∑i=1

(y(i) − y(i))2 (2)

where m is the data size, and z(i), y(i) and y(i) are realnumbers.

BinaryMultinoulli

Cost Function

Cross Entropy

Adapting the symbols above, the cost function defined byCross Entropy is

Cce =1

m∑i=1

y(i) ln(y(i)) + (1− y(i)) ln(1− y(i)) (2)

where m is the data size, and z(i), y(i) and y(i) are realnumbers.

BinaryMultinoulli

Comparison between MSE and Cross Entropy

Problem: Which one is better?

Analyticity(infinitely differentiable)

Learning ability(first order derivatives)

BinaryMultinoulli

Analyticity:

Cmse =1

m∑i=1

(y(i) − y(i))2

Cce =1

m∑i=1

y(i) ln(y(i)) + (1− y(i)) ln(1− y(i))

Computationally, the value of y(i) = S(z(i)) could overflow to1 or underflow to 0 when z(i) is very positive or very negative.Therefore, given a fixed y(i) ∈ {0, 1},

Cce is undefined at y(i) is 0 or 1.

Cmse is polynomial and thus analytic every where.

BinaryMultinoulli

Learning Ability: compare the gradients

∂Cmse∂w

= [S(z)− y] [1− S(z)]S(z)h, (3)

∂Cce∂w

= [y − S(z)]h (4)

respectively, where S is sigmoid, z = w>h+ b.

BinaryMultinoulli

MSE Cross Entropy[S(z)− y] [1− S(z)]S(z)h [y − S(z)]h

If y = 1 and y → 1,steps → 0

If y = 0 and y → 1,steps → −1

In the ceas of Mean Square Error, the progress get stuck whenz is very positive or very negative.

BinaryMultinoulli

MSE Cross Entropy[S(z)− y] [1− S(z)]S(z)h [y − S(z)]h

If y = 0 and y → 1,steps → −1

In the ceas of Mean Square Error, the progress get stuck whenz is very positive or very negative.

BinaryMultinoulli

The Unstable Issue in Cross Entropy

We have mentioned about the unstable issue of crossentropy.

Precisely,

y = S(z) underflow to 0 when z is very negative,

y = S(z) overflow to 1 when z is very positive.

Therefore, given a fixed y ∈ {0, 1}, then the function

C = y ln y + (1− y) ln(1− y)

could be undefined when z is very positive or verynegative.

BinaryMultinoulli

We have mentioned about the unstable issue of crossentropy.

Precisely,

y = S(z) underflow to 0 when z is very negative,

y = S(z) overflow to 1 when z is very positive.

Therefore, given a fixed y ∈ {0, 1}, then the function

C = y ln y + (1− y) ln(1− y)

could be undefined when z is very positive or verynegative.

BinaryMultinoulli

Alternatively, regarding z as the variable of cross entropy

C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.

BinaryMultinoulli

C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.

We may obtain the analyticity of C by showing the dCdz

ismultiple of analytic functions.

BinaryMultinoulli

C = y lnS(z) + (1− y) ln(1− S(z)) (5)

= −ζ(−z) + z(y − 1), (6)

where ζ is the softplus and z is real number.In the cases of right answer

y = 1 and y = S(z)→ 1⇒ z →∞, C → 0,

y = 0 and y = S(z)→ 0⇒ z → −∞, C → 0.

In the cases of wrong answer

y = 1 and y = S(z)→ 0⇒ z → −∞,∇C → −1,

y = 0 and y = S(z)→ 1⇒ z →∞,∇C → −1.

BinaryMultinoulli

Outline

1 Introduction

BinaryMultinoulli

Multinoulli: Output Unit and Cost Function

Generalize the binary case to multiple classes.Linear output units and #(output units) = #(classes).Cost function evaluated by cross entropy.

Cost Function in Multinoulli Problems

Suppose the size of dataset is m and there are K classes, thenwe can obtain the cost function from cross entropy

C(w) = −

[m∑i=1

K∑k=1

1{y(i) = k} lnexp(z

(i)k )∑K

j=1 exp(z(i)j )

where z(i)k = w>k h

(i) + bk and h(i) is the output of hidden layercorresponding to example data xi.

BinaryMultinoulli

Multinoulli: Output Unit and Cost Function

Generalize the binary case to multiple classes.Linear output units and #(output units) = #(classes).Cost function evaluated by cross entropy.

Cost Function in Multinoulli Problems

Suppose the size of dataset is m and there are K classes, thenwe can obtain the cost function from cross entropy

C(w) = −

[m∑i=1

K∑k=1

1{y(i) = k} lnexp(z

(i)k )∑K

j=1 exp(z(i)j )

where z(i)k = w>k h

(i) + bk and h(i) is the output of hidden layercorresponding to example data xi.

BinaryMultinoulli

A Lemma for Cost Function Simplify

BinaryMultinoulli

To claim above properties, We should show a lemma at veryfirst,

Lemma 1

For the output z = w>h+ b and z = [z1, . . . , zK], we have

K∑j=1

exp(zj)

)= max

j{zj}. (8)

BinaryMultinoulli

Proof.

Without loss of generality, we may assume z1 > . . . > zK ,then the remaining work is to show, for all ε > 0.

K∑j=2

ezj−z1

)]= z1 + ln

K∑j=2

ezj−z1

)≤ z1 + ε

Intuitively, the ln∑∑∑K

j=1exp (zj) can be well approximated

by maxj

BinaryMultinoulli

Analyticity

We may rewrite the cost function as

C(w) = −

{m∑i=1

K∑k=1

1{y(i) = k}

[z(i)k − ln

K∑j=1

exp(z(i)j )

For each summand, it is substraction of analytic function andthus analytic, and the term 1{y(i) = k} is acturally a constant.The total cost is summation of analytic functions and thusanalytic.

BinaryMultinoulli

Learning Ability

Property 2

By the rule of sum in derivatives, we may simplify the (7) asfollowing

C(i) =K∑k=1

1{y = k}

[zk − ln

K∑j=1

exp(zj)

], (8)

this cost is contributed by the example xi in the total cost C.

1 Assume the model gives the right answer, then theerrors would close to 0.

2 Assume the model gives the wrong answer, then thelearning can prograss well.

BinaryMultinoulli

Learning Ability

Proof (The Right Answer).

Suppose the true label is class n. By the assumption, weknow zn is the maxmal. Then

−ε ≤K∑k=1

1{y = k}

[zk − ln

K∑j=1

exp(zj)

= zn − lnK∑j=1

exp(zj)

< zn −maxj{zj} = 0.

This shows that −ε ≤ C(i) < 0 for an arbitrary small ε.16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

BinaryMultinoulli

Learning Ability

Proof (The Wrong Answer).

Suppose the true label is class n. By assumption, theprediction zn given by model is not the maxmal. On the otherhand, using the fact

zn 6= maxj{zj} ⇒ softmax(zn) � 1.

This implies that there exist a sufficient large δ > 0 such that| softmax(zn)− 1 |> δ.

BinaryMultinoulli

Learning Ability

Proof (The Wrong Answer, Conti.)

∂C(i)

∂zn=

[zn − ln

K∑j=1

]= 1− softmax(zn)

This shows the gradient is sufficently large and alsopredictable(bounded by 1), therefore the learning can progresswell.

Outline

1 Introduction

Learning Processes Overview

Deterministic GenericStep1 Model function

Linear

Sigmoid

Probability distribution

Gaussian

BernoulliStep2 Design errors evals

Cross Entropy

Maximum Likelihood Es-timate

Step3 Learning one statistic

Median

Learning full distribution

To describe some complicate data, it’s easier to build modelwith generic method.

Learning Processes Overview

Deterministic GenericStep1 Model function

Linear

Sigmoid

Probability distribution

Gaussian

BernoulliStep2 Design errors evals

Cross Entropy

Maximum Likelihood Es-timate

Step3 Learning one statistic

Median

Learning full distribution

To describe some complicate data, it’s easier to build modelwith generic method.

Generic Modeling for Binary Classification

Step1: Using Bernoulli distribution as likelihood function.

p(y | x) = py(1− p)1−y

= S(z)y(1− S(z))1−y

Step2: Minimizing negative log-likelihood

lnp(y | x(i)) = y lnS(z) + (1− y) ln(1− S(z))

Step3: We an learn the full distribution.

p(y | x′) = S(z′)y(1− S(z′))1−y,

where we denote z′ = w>x′ + b and S is sigmoid.

Generic Modeling for Linear Regression: Step1

Given a training feature x, using Gaussian distribution aslikelihood function

p(y | x) =1√

2σ2πexp

(−(µ− y)2

where we denote the output of hidden layer as hx, weightw = [w1, w2] and bias b = [b1, b2], then

µ = w>1 hx + b1

σ = w>2 hx + b2

Intuitively, µ and σ are two linear output units, they arefunctions of x.

Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is

(µ, σ) = arg min(µ,σ)

(−∑x

lnp(y | x)

However, for each summand,

Cx = lnp(y | x) =−1

[ln(2πσ2) +

(µ− y)2

]∂Cx∂σ

= (πσ)−1 − 2σ−3(µ− y)

the gradients and errors become unstable when σ close 0.

Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is

(µ, σ) = arg min(µ,σ)

(−∑x

lnp(y | x)

However, for each summand,

Cx = lnp(y | x) =−1

[ln(2πσ2) +

(µ− y)2

]∂Cx∂σ

= (πσ)−1 − 2σ−3(µ− y)

the gradients and errors become unstable when σ close 0.21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network

To prevent the gradients and errors from being unstable, wemay substitute the term 1

2σ2 with v, then for each summand inthe negative log-likelihood

Cx = lnπ − ln v − (µ− y)2v,

∂Cx∂µ

= −2v(µ− y),

∂Cx∂v

v− (µ− y)2.

Note that, this substitution valid only when the variance isn’ttoo large.

If the variance σ is fixed and chosen by user, then bycomparing the negative log-likelihood and MSE, we can seethat minimizing NLL is equivalent to minimizing MSE.

Cmse =1

m∑i=1

‖y(i) − y(y)‖2

Cnll =m∑i=1

[m ln(2πσ2) +

m∑i=1

‖µx(i) − y(i)‖2

Full distribution from Generic, µ and σ in this case.

Single statistics from Deterministic, µ in this case.

Full distribution from Generic, µ and σ in this case.

Single statistics from Deterministic, µ in this case.

Experiment(ref): generate random data base on the formula

y = x+ 7.0 sin(0.75x) + ε

where ε is the gaussian noise with µ = 0, σ = 1

Full distribution from Generic, µ and σ in this case.Single statistics from Deterministic, µ in this case.

FNN config:#(hidden layey) = 1, width = 20 and hidden unit is tanh.

Gerneric Deterministic

More Complicated Cases

Complicated data distributions.

In some cases, it’s almost impossible to describe data viadeterministic methods.

Generic methods might perform better in complicatedcase.

Mixture Density Network

Generate random data based on the formula

x = y + 7.0 sin(0.75y) + ε

where ε is the gaussian noise with µ = 0, σ = 1

Firstly, just try to using MSE to define cost function and onehidden layer with width = 20, hidden unit is tanh.

The reason is, minimizing MSE isequivalant to minimizing nagetive log-likelihood for simpleGaussian.

The mixture density network. The Gaussian mixture with ncomponents is defined by the conditional probabilitydistribution

p(y | x) =n∑i=1

p(c = i|x)ℵ(y;µ(i)(x); Σ(i)(x)). (9)

Network configuration,

1 Number of components n, need to be fine tuned(try anderror).

2 3× n output units.

Experiment(ref):

#(components) = 24,

two hidden layers with width = 24 and activation is tanh,

#(output units) = 3× 24 and they are linear.

Outline

1 Introduction

In classification problems, cross entropy is naturallygood to evaluate errors than other methods.

An cross entropy improvement to avoid numericallyunstable.

– The MNIST example from Tensorflow.

Determine the cost function is good or not.

– Is the cost function analytic?– Can the learning progress well?

Deterministic v.s. Generic

– Deterministic learns single statistic while generic learnfull distribution.

– When data distribution is not normal(high kurtosis or fattail), generic might be better.

– Generic methods is easier to apply to complicated cases.

Thank you.

Output Units and Cost Function in FNN

Data & Analytics

160823 - Jumbo FNN Presentation

HUMAN COMPUTER INTERACTION 3: OUTPUT UNITS Printers, monitors and special purpose units. Focus on character (and image) formation and transfer

NX-series Digital Output Units NX-OD/OC - Mouser · PDF fileCSM_NX-OD_OC_DS_E_5_1 1 NX-series Digital Output Units NX-OD/OC A Wide Range of Digital Output Units from General Purpose

input &output units By shimaa mahmud

FNN SHINHAN NETWORKS

FNN 96/98 VLADA REPUBLIKE HRVATSKE

PRESENTATION TO THE COMMISSION OF INQUIRY …...Framework based on Programme Qualification Mix (PQM), teaching input units, teaching output units, research output units, institutional

Analog I/O Units - Omron · 1 Introduction NX-series Analog I/O Units User’s Manual for Analog Input Units and Analog Output Units (W522) Introduction Thank you for purchasing an

FNN football 2014

Concord 4 Programming Chart Field Notes - Interlogix · Concord 4 Programming Chart 10817-FNN • ISS 18AUG10 1 of 7 . 10817-FNN • ISS 18AUG10 2 of 7 . 10817-FNN • ISS 18AUG10

Slave Terminals NX Series Ordering Information · Slave Terminals NX Series 20 Digital Output Units Transistor Output Units (Screwless Clamping Terminal Block, 12 mm Width) *To use

NX-series Digital Output Units NX-OD/OComronkft.hu/nostree/pdfs/sysmac/nx-od_oc_e_1_1.pdf · CSM_NX-OD_OC_DS_E_1_1 1 NX-series Digital Output Units NX-OD/OC A Wide Range of Basic

CSM CJ1W-OC OA OD DS E 8 7 · 2020-05-27 · CSM_CJ1W-OUTPUT_DS_E_8_7 1 CJ-series Output Units CJ1W-OC/OA/OD A Wide Range of Basic Output Units for High Speed Output and Different

FNN-Hinweis - vde.com · 1.2 Aufbau des Schemas C Das Schema C soll als eine zusätzliche Option in das FNN-Erfassungsschema integriert werden. Die Struktur der Schadenserfassung

13 Abordarea prin venit (2 cursuri-FNN, capitalizare ).ppt

* Input Units. * Output Units. * Storage Units. The physical elements of a computer, its hardware, are generally divided into:

Metropolitan high output trench units for cooling - SPC · 2018. 11. 9. · Metropolitan high output trench units for cooling 7 Controls for comfort All units can be supplied with

FNN XVäDOäjc0VäXSc—N 0VR— 1OVY

NX-series Digital Output Units NX-OD/OCCSM_NX-OD_OC_DS_E_5_1 1 NX-series Digital Output Units NX-OD/OC A Wide Range of Digital Output Units from General Purpose use to High-Speed Synchronous

CJ-series Output Units CJ1W-OC/OA/OD · CJ-series Output Units CJ1W-OC/OA/OD A Wide Range of Basic Output Units for High Speed Output and Different Applications • These Output Units