Upload
lorin-black
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
The
Uni
vers
ity
of M
anch
este
r
Introducción al análisis del código neuronal con métodos de la teoría de la
información
Dr Marcelo A [email protected]
Information theory
Entropy
Suppose there is a source that produces symbols, taken from a given alphabet
Assume also that there is a certain probability distribution, with support over the alphabet, that determines that outcome of the source (for the moment we assume iid sources).
Probability of observing the outcome i
Normalisation of a probability distribution
We define the ‘surprise’ of event i as
Empirical determination of a probability
There are ni outcomes of event i in a total of N trials. Then if N>>1
[bits]
heads tails
p(heads)=0.5
p(tails)=0.5
What is the average surprise?
Average of a random variable
Example
Then the average surprise is
Entropy
For our coin,
Frequency of letters in English text
p(a)=0.082; p(e)=0.127; p(q)=0.001
Surprise of letter ‘e’
Surprise of letter ‘q’
Example
If all the letters appeared with the same probability, then
and
Which is larger than for the real distribution. It can be shown that the entropy attains its maximum value for a uniform distribution.
Imagine a loaded die that produces always the same outcome
What is the surprise of each outcome?
What is the average surprise?
What if the dice is fair?
What is the surprise of each outcome?
What is the average surprise?
In general, the less uniform a distribution (less random) the lower is the entropy
In general, for the independent binary variable case
Thus for a noiseless communication system the entropy quantifies the amount of information that can be encoded in the signal
Signal with low entropy -> low information
Signal with high entropy -> high information
a
b
0
1
p(a)
p(b)
Noiseless channel
trial 1
trial 2
trial 3
trial 3
Stimulus 1
(3)
(4)
(3)
(2)
trial 1
trial 2
trial 3
trial 3
Stimulus 2
(5)
(6)
(5)
(4)
However, many real systems, like neurons, have a noisy output
Because of the noise, a new variability has to be taken into account. On the one hand, we have the variability of the stimulus (good variability); on the other we have the variability created by the noise (bad variability)
How to handle this more complex problem? How can we quantify information in the presence of noise in the channel?
noisy channel receivertransmitterX Y
p(Y|X)
a
b
0
1
p(a)
p(b)
Noiseless channel
a
b
0
1
p(a)
p(b)
Noisy channel
stimuluss
responser
P(r|s)
Probabilistic dictionary
•The amount of information about the stimulus encoded in the neural response is quantified by the Mutual Information I(S;R)
•In general Mutual Information quantifies how much can be known about one variable by looking at the other.
•It can be computed from real data by characterising the stimulus-response statistics.
Mutual Information
Response entropy: variability of the whole response
Noise entropy: variability of the response at fixed stimulus
2( ) ( ) log ( )r
H R P r P r
2( | ) ( | ) log ( | )r s
H R S P r s P r s
a
b
0
1
Noisy binary channel
Stimulus={a, b}
p(S)={p(a),p(b)}
Response={0,1}
p(R)={p(0),p(1)}
P(R|S)=
StimulusResponse
Probabilistic dictionary
Simple example
a
b
0
1p(S)={0.5, 0.5}
Let us first find p(R)={p(0), p(1)}
We must find p(0) and p(1)
then
Now we can find the entropies to compute the information
Then, to compute the information we just take the difference between the two entropies
a
b
0
1
What is the meaning of information?
Response entropy: variability of the whole response
Noise entropy: variability of the response at fixed stimulus
Stimulus entropy: variability of the whole stimulus
Noise entropy: variability of the stimulus at fixed response
Meaning 1 : Number of yes/no questions to indentify the stimulus
Stimulus 1 Response 1Stimulus 2 Response 2Stimulus 3 Response 3
P(S)=1/4
H(S)
Stimulus 4 Response 4
Before observing the responses, questions need to be asked on average
When a response is observed, questions need to be asked on average
a) Deterministic responses
H(S)=2
Stimulus 1 Response 1Response 2Response 3
P(S)=1/2
H(S)
Stimulus 2
Before observing the responses, questions need to be asked on average
When a response is observed, questions need to be asked on average
b) Overlapping responses
H(S)=1
Information measures the reduction in uncertainty about the stimulus, after the responses are observed
trial 1
trial 2
trial 3
trial 3
Stimulus 1
(3)
(4)
(3)
(2)
trial 1
trial 2
trial 3
trial 3
Stimulus 2
(5)
(6)
(5)
(4)
Meaning 2: upper bound to the number of messages that can be transmitted through a communication channel
Question: what is the number of stimuli n that can be encoded in the neural response such that their responses do not overlap?
responses to S1 all responses
responses to S2
responses to S3responses to S4
Typical sequences
( )1 2 1..... iidn
n i j j kx x x x x
also ( )j jp p
What is the probability of a given sequence? 1 21 2 ...
knn nkp p p p
A typical sequence is such that every symbol appears e number of times equal to its average i in np
1 21 2 ...
knpnp npkp p p pThen the probability of a typical sequence will be
Taking logs
Then 2 nHp Is the probability of each typical sequence
Example:
1,0 2 iidix k (2) (2)01 00x x
2 nHp Is the probability of each typical sequence. What is the probability all typical sequences?
First, how many typical sequences are there?
If is the number of all typical sequences, then the total probability is p
Example:
, .. ...ix a b x aaaa abbbb b
1n 2n
2 2n n n 1 1 2 1 2
! !
! ! ( )!( )!
n n n
n n n p n p n
When we have k symbols1 2
!
( )!( )!...( )!k
n
p n p n p n
If the sequences are very long, we can compute log
(using Stirling’s approximation: log(n!)=n log (n)-n)
and
Question: what is the number of stimuli n that can be encoded in the neural response such that their responses do not overlap?
responses to S1 all responses
responses to S2
responses to S3responses to S4
Simple explanation
there are typically 2 H(R) responses that could generated by the stimulus
However, due to the ‘noise’ fluctuations in theresponse a number 2H(R|S) of different responses that can be attributed to the same stimulus
2 H(R)
2 H(R|S)
Then, how many stimuli can be reliably encoded in the neural response?
Therefore, finding that a neuron transmits n bits of information within a behaviourally relevant time window, means that there are potentially 2n different stimuli that can be discriminated only on the basis of the neuron’s response.
),()|()()|(
)(
222
2 SRISRHRHSRH
RH
How do we estimate information in a neural system?
External stimulus
Sensory system
Spike trains
Encoding
T [ms]
L
10101001 …0010
11101010 …1101
00111101 …0110
S1
stimuli
trials per stimulus
S2 S3Stimulus conditions
P(r|s)Ns
S
r=(r1, r2, …, rL)
Each stimulus is presented with probability P(s)
T=L Δt
P(r|t) P(r)Response probability conditional to the stimulus (at fixed time t)
Unconditional response probability
Tria
ls
P(r|t)
P(r|t)0 0 0 0 0 0 0 0 1 0
Time window T
Bin of size ∆t
Response entropy: variability of the whole response
Noise entropy: variability of the response at fixed time
Mutual Information quantifies how much variability is left after subtracting the effect of noise. It is measured in bits (Meaning 3)
ttrPrP )|()(
)(log)()( 2 rPrPRHr
tr
trPtrPSRH )|(log)|()|( 2
)|()(),( SRHRHSRI
tTL /
To measure P(r|s) we need to estimate up to 2L-1 parameters from the data
The statistical errors in the estimation of P(r|s) lead to a systematic bias in the entropies
Number of response ‘words’with non zero probability
For Ns>>1 we can obtain a first order approximation to the bias
With N=Ns S
Bias in the information estimation
Miller, A G. Info. Theory in Psychology (1955)
The response is more random. Responses are more uniformly spread over possible response words
The response is less random. Responses are more concentrated over a few response words
large so bias is large
Small, so bias is small
Because of the bias the information is overestimated
-Bias [H(R|S)]>-Bias[H(R)] Bias [H(R)-H(R|S)]>0
Adapted from Panzeri at al J. Neurophysiol 2007
A lower bound to the information
For words of length L, we need to estimate at least 2^L parameters from the data!
Independent model
In general
Using the independent model we can compute
To estimate this probability we need only 2L parameters!
This entropy is much less biased
Trial 1 1 0 1 0Trial 2 0 1 0 1Trial 3 1 0 0 1
r1 r2 r3 r4Shuffling Trial 1 1 1 1 1
Trial 2 0 0 0 0Trial 3 1 0 0 1
r1 r2 r3 r4
There is an alternative way of estimating the entropy of the independent model.
Instead of neglecting the correlations by computing the marginals, we simply destroy them in the original dataset.
a)
b)
Essentially because shuffling creates a larger number of response words with non zero probability
4 6 8 10 120
0.5
1
1.5
Log2(trials)
Info
rma
tion
[bits
]
IIsh
I Ish
4 6 8 10 12
0.02
0.04
0.06
0.08
Log2(trials)
Info
rma
tion
[bits
]
σI
σIsh
σ I
σ Ish
Montemurro et al Neural Computation (2007)
Now we propose the following estimator fro the entropy
Further improvements can be achieved with extrapolation methods
We have N trials. We then get estimates of the entropy for different subsets of trials: N/2, and N/4
This gives 3 estimation of the information: I1, I2, and I4
Up to 2nd order this is the equation of a parabola in 1/N.
Quadratic extrapolation
The practicalEfficiency of neural code of the H1 neuron of the fly
Experiment was done: right before sunset, at midday, and right after sunset
The same visual seen was presented 100-200 times.
1) Examine the data2) Generate rasters for the three conditions3) Compute the time varying firing rate, allowing for different binnings.4) Compute spike-count information as a function of window size5) Compute spike-time information as a function of window size6) Determine the maximum response word length for which the estimation is accurate7) Compute the efficiency of the code: e=I(R,S)/H(R)=1-H(R|S)/H(R)8) Discuss