View
159
Download
0
Category
Preview:
Citation preview
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Deep Neural NetworkCost Functions and Output Units
Jiaming Linjmlin@arbor.ee.ntu.edu.tw
DATALab@IIINetDBLab@NTU
January 9, 2017
1 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
2 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learningproblems.
– Classification: sigmoid, softmax or linear.– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function isinfinitely differentiable on the domain.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learningproblems.
– Classification: sigmoid, softmax or linear.– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function isinfinitely differentiable on the domain.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Introduction
In the neural network learning...
The selection of output unit depends on the learningproblems.
– Classification: sigmoid, softmax or linear.– Linear Regression: linear.
Determine and analyse the cost function.
– Is the cost function †analytic?– Can the learning progress well(first order derivative)?
Deterministic and Generic Model.
– Data is more complicated in many cases.
Note: †For simplicity, we mean analytic to say a function isinfinitely differentiable on the domain.
3 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
4 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
5 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Binary
index x1 · · · xn target1 0 · · · 1 Class A2 1 · · · 0 Class B3 1 · · · 1 Class A· · · · · · · · · · · · · · ·m 0 · · · 0 Class B
6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Binary
whereS is the sigmoid function,z is the input of output layer
z = w>h+ b (1)
with w is weight, h is output of hidden layer and b is bias.6 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Cost Function
Cost function can be derived from many methods, we discusstwo of the most common
Mean Square Error
Cross Entropy
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Cost Function
Cost function can be derived from many methods, we discusstwo of the most common
Mean Square Error
Let y(i) denotes the data label, and y(i) = S(z(i)) as theprediction. We may define the cost function Cmse by
Cmse =1
m
m∑i=1
(y(i) − y(i))2 (2)
where m is the data size, and z(i), y(i) and y(i) are realnumbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Cost Function
Cost function can be derived from many methods, we discusstwo of the most common
Cross Entropy
Adapting the symbols above, the cost function defined byCross Entropy is
Cce =1
m
m∑i=1
y(i) ln(y(i)) + (1− y(i)) ln(1− y(i)) (2)
where m is the data size, and z(i), y(i) and y(i) are realnumbers.
7 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Comparison between MSE and Cross Entropy
Problem: Which one is better?
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Comparison between MSE and Cross Entropy
Analyticity:
Cmse =1
m
m∑i=1
(y(i) − y(i))2
Cce =1
m
m∑i=1
y(i) ln(y(i)) + (1− y(i)) ln(1− y(i))
Computationally, the value of y(i) = S(z(i)) could overflow to1 or underflow to 0 when z(i) is very positive or very negative.Therefore, given a fixed y(i) ∈ {0, 1},
Cce is undefined at y(i) is 0 or 1.
Cmse is polynomial and thus analytic every where.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Comparison between MSE and Cross Entropy
Learning Ability: compare the gradients
∂Cmse∂w
= [S(z)− y] [1− S(z)]S(z)h, (3)
∂Cce∂w
= [y − S(z)]h (4)
respectively, where S is sigmoid, z = w>h+ b.
8 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy[S(z)− y] [1− S(z)]S(z)h [y − S(z)]h
If y = 1 and y → 1,steps → 0
If y = 1 and y → 0,steps → 0
If y = 0 and y → 1,steps → 0
If y = 0 and y → 0,steps → 0
If y = 1 and y → 1,steps → 0
If y = 1 and y → 0,steps → 1
If y = 0 and y → 1,steps → −1
If y = 0 and y → 0,steps → 0
In the ceas of Mean Square Error, the progress get stuck whenz is very positive or very negative.
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Comparison between MSE and Cross Entropy
MSE Cross Entropy[S(z)− y] [1− S(z)]S(z)h [y − S(z)]h
If y = 1 and y → 1,steps → 0
If y = 1 and y → 0,steps → 0
If y = 0 and y → 1,steps → 0
If y = 0 and y → 0,steps → 0
If y = 1 and y → 1,steps → 0
If y = 1 and y → 0,steps → 1
If y = 0 and y → 1,steps → −1
If y = 0 and y → 0,steps → 0
In the ceas of Mean Square Error, the progress get stuck whenz is very positive or very negative.
9 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of crossentropy.
Precisely,
y = S(z) underflow to 0 when z is very negative,
y = S(z) overflow to 1 when z is very positive.
Therefore, given a fixed y ∈ {0, 1}, then the function
C = y ln y + (1− y) ln(1− y)
could be undefined when z is very positive or verynegative.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
The Unstable Issue in Cross Entropy
We have mentioned about the unstable issue of crossentropy.
Precisely,
y = S(z) underflow to 0 when z is very negative,
y = S(z) overflow to 1 when z is very positive.
Therefore, given a fixed y ∈ {0, 1}, then the function
C = y ln y + (1− y) ln(1− y)
could be undefined when z is very positive or verynegative.
10 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y lnS(z) + (1− y) ln(1− S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y lnS(z) + (1− y) ln(1− S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.
We may obtain the analyticity of C by showing the dCdz
ismultiple of analytic functions.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
The Unstable Issue in Cross Entropy
Alternatively, regarding z as the variable of cross entropy
C = y lnS(z) + (1− y) ln(1− S(z)) (5)
= −ζ(−z) + z(y − 1), (6)
where ζ is the softplus and z is real number.In the cases of right answer
y = 1 and y = S(z)→ 1⇒ z →∞, C → 0,
y = 0 and y = S(z)→ 0⇒ z → −∞, C → 0.
In the cases of wrong answer
y = 1 and y = S(z)→ 0⇒ z → −∞,∇C → −1,
y = 0 and y = S(z)→ 1⇒ z →∞,∇C → −1.
11 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
12 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.Linear output units and #(output units) = #(classes).Cost function evaluated by cross entropy.
Cost Function in Multinoulli Problems
Suppose the size of dataset is m and there are K classes, thenwe can obtain the cost function from cross entropy
C(w) = −
[m∑i=1
K∑k=1
1{y(i) = k} lnexp(z
(i)k )∑K
j=1 exp(z(i)j )
](7)
where z(i)k = w>k h
(i) + bk and h(i) is the output of hidden layercorresponding to example data xi.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Multinoulli: Output Unit and Cost Function
Generalize the binary case to multiple classes.Linear output units and #(output units) = #(classes).Cost function evaluated by cross entropy.
Cost Function in Multinoulli Problems
Suppose the size of dataset is m and there are K classes, thenwe can obtain the cost function from cross entropy
C(w) = −
[m∑i=1
K∑k=1
1{y(i) = k} lnexp(z
(i)k )∑K
j=1 exp(z(i)j )
](7)
where z(i)k = w>k h
(i) + bk and h(i) is the output of hidden layercorresponding to example data xi.
13 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
A Lemma for Cost Function Simplify
Analyticity(infinitely differentiable)
Learning ability(first order derivatives)
To claim above properties, We should show a lemma at veryfirst,
Lemma 1
For the output z = w>h+ b and z = [z1, . . . , zK], we have
supz
(ln
K∑j=1
exp(zj)
)= max
j{zj}. (8)
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
A Lemma for Cost Function Simplify
Proof.
Without loss of generality, we may assume z1 > . . . > zK ,then the remaining work is to show, for all ε > 0.
ln
[ez1
(1 +
K∑j=2
ezj−z1
)]= z1 + ln
(1 +
K∑j=2
ezj−z1
)≤ z1 + ε
Intuitively, the ln∑∑∑K
j=1exp (zj) can be well approximated
by maxj
{zj}.
14 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Analyticity
We may rewrite the cost function as
C(w) = −
{m∑i=1
K∑k=1
1{y(i) = k}
[z(i)k − ln
K∑j=1
exp(z(i)j )
]}.
For each summand, it is substraction of analytic function andthus analytic, and the term 1{y(i) = k} is acturally a constant.The total cost is summation of analytic functions and thusanalytic.
15 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Learning Ability
Property 2
By the rule of sum in derivatives, we may simplify the (7) asfollowing
C(i) =K∑k=1
1{y = k}
[zk − ln
K∑j=1
exp(zj)
], (8)
this cost is contributed by the example xi in the total cost C.
1 Assume the model gives the right answer, then theerrors would close to 0.
2 Assume the model gives the wrong answer, then thelearning can prograss well.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Learning Ability
Proof (The Right Answer).
Suppose the true label is class n. By the assumption, weknow zn is the maxmal. Then
−ε ≤K∑k=1
1{y = k}
[zk − ln
K∑j=1
exp(zj)
]
= zn − lnK∑j=1
exp(zj)
< zn −maxj{zj} = 0.
This shows that −ε ≤ C(i) < 0 for an arbitrary small ε.16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Learning Ability
Proof (The Wrong Answer).
Suppose the true label is class n. By assumption, theprediction zn given by model is not the maxmal. On the otherhand, using the fact
zn 6= maxj{zj} ⇒ softmax(zn) � 1.
This implies that there exist a sufficient large δ > 0 such that| softmax(zn)− 1 |> δ.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
BinaryMultinoulli
Learning Ability
Proof (The Wrong Answer, Conti.)
Then
∂C(i)
∂zn=
∂
∂zn
[zn − ln
K∑j=1
ezj
]= 1− softmax(zn)
> δ
This shows the gradient is sufficently large and alsopredictable(bounded by 1), therefore the learning can progresswell.
16 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
17 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic GenericStep1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
BernoulliStep2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
To describe some complicate data, it’s easier to build modelwith generic method.
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Learning Processes Overview
Deterministic GenericStep1 Model function
Linear
Sigmoid
Probability distribution
Gaussian
BernoulliStep2 Design errors evals
MSE
Cross Entropy
Maximum Likelihood Es-timate
Step3 Learning one statistic
Mean
Median
Learning full distribution
To describe some complicate data, it’s easier to build modelwith generic method.
18 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Binary Classification
Step1: Using Bernoulli distribution as likelihood function.
p(y | x) = py(1− p)1−y
= S(z)y(1− S(z))1−y
Step2: Minimizing negative log-likelihood
lnp(y | x(i)) = y lnS(z) + (1− y) ln(1− S(z))
Step3: We an learn the full distribution.
p(y | x′) = S(z′)y(1− S(z′))1−y,
where we denote z′ = w>x′ + b and S is sigmoid.
19 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution aslikelihood function
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step1
Given a training feature x, using Gaussian distribution aslikelihood function
p(y | x) =1√
2σ2πexp
(−(µ− y)2
2σ2
),
where we denote the output of hidden layer as hx, weightw = [w1, w2] and bias b = [b1, b2], then
µ = w>1 hx + b1
σ = w>2 hx + b2
Intuitively, µ and σ are two linear output units, they arefunctions of x.
20 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is
(µ, σ) = arg min(µ,σ)
(−∑x
lnp(y | x)
)(8)
However, for each summand,
Cx = lnp(y | x) =−1
2
[ln(2πσ2) +
(µ− y)2
σ2
]∂Cx∂σ
= (πσ)−1 − 2σ−3(µ− y)
the gradients and errors become unstable when σ close 0.
21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
Recall that the maximum likelihood estimate is equivalent tominimize the negative log-likelihood, that is
(µ, σ) = arg min(µ,σ)
(−∑x
lnp(y | x)
)(8)
However, for each summand,
Cx = lnp(y | x) =−1
2
[ln(2πσ2) +
(µ− y)2
σ2
]∂Cx∂σ
= (πσ)−1 − 2σ−3(µ− y)
the gradients and errors become unstable when σ close 0.21 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
To prevent the gradients and errors from being unstable, wemay substitute the term 1
2σ2 with v, then for each summand inthe negative log-likelihood
Cx = lnπ − ln v − (µ− y)2v,
∂Cx∂µ
= −2v(µ− y),
∂Cx∂v
=1
v− (µ− y)2.
Note that, this substitution valid only when the variance isn’ttoo large.
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step2
If the variance σ is fixed and chosen by user, then bycomparing the negative log-likelihood and MSE, we can seethat minimizing NLL is equivalent to minimizing MSE.
Cmse =1
m
m∑i=1
‖y(i) − y(y)‖2
Cnll =m∑i=1
Cx(i)
=−1
2
[m ln(2πσ2) +
m∑i=1
‖µx(i) − y(i)‖2
σ2
]
22 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.
Single statistics from Deterministic, µ in this case.
Experiment(ref): generate random data base on the formula
y = x+ 7.0 sin(0.75x) + ε
where ε is the gaussian noise with µ = 0, σ = 1
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Generic Modeling for Linear Regression: Step3
Full distribution from Generic, µ and σ in this case.Single statistics from Deterministic, µ in this case.
FNN config:#(hidden layey) = 1, width = 20 and hidden unit is tanh.
Gerneric Deterministic
23 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
More Complicated Cases
Complicated data distributions.
In some cases, it’s almost impossible to describe data viadeterministic methods.
Generic methods might perform better in complicatedcase.
24 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Generate random data based on the formula
x = y + 7.0 sin(0.75y) + ε
where ε is the gaussian noise with µ = 0, σ = 1
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and onehidden layer with width = 20, hidden unit is tanh.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Firstly, just try to using MSE to define cost function and onehidden layer with width = 20, hidden unit is tanh.
The reason is, minimizing MSE isequivalant to minimizing nagetive log-likelihood for simpleGaussian.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
The mixture density network. The Gaussian mixture with ncomponents is defined by the conditional probabilitydistribution
p(y | x) =n∑i=1
p(c = i|x)ℵ(y;µ(i)(x); Σ(i)(x)). (9)
Network configuration,
1 Number of components n, need to be fine tuned(try anderror).
2 3× n output units.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Mixture Density Network
Experiment(ref):
#(components) = 24,
two hidden layers with width = 24 and activation is tanh,
#(output units) = 3× 24 and they are linear.
25 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Outline
1 Introduction
2 Output Units and Cost FunctionsBinaryMultinoulli
3 Deterministic and Generic Model
4 Concludsions and Discussions
26 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturallygood to evaluate errors than other methods.
An cross entropy improvement to avoid numericallyunstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learnfull distribution.
– When data distribution is not normal(high kurtosis or fattail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturallygood to evaluate errors than other methods.
An cross entropy improvement to avoid numericallyunstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learnfull distribution.
– When data distribution is not normal(high kurtosis or fattail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturallygood to evaluate errors than other methods.
An cross entropy improvement to avoid numericallyunstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learnfull distribution.
– When data distribution is not normal(high kurtosis or fattail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
In classification problems, cross entropy is naturallygood to evaluate errors than other methods.
An cross entropy improvement to avoid numericallyunstable.
– The MNIST example from Tensorflow.
Determine the cost function is good or not.
– Is the cost function analytic?– Can the learning progress well?
Deterministic v.s. Generic
– Deterministic learns single statistic while generic learnfull distribution.
– When data distribution is not normal(high kurtosis or fattail), generic might be better.
– Generic methods is easier to apply to complicated cases.
27 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
IntroductionOutput Units and Cost FunctionsDeterministic and Generic Model
Concludsions and Discussions
Thank you.
28 / 28 Jiaming Lin jmlin@arbor.ee.ntu.edu.tw Deep Neural Network
Recommended