90
Introduction To Information Theory Edward Witten PiTP 2018

Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Introduction To Information Theory

Edward Witten

PiTP 2018

Page 2: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We will start with a very short introduction to classical informationtheory (Shannon theory).

Suppose that you receive a message that consists of a string ofsymbols a or b, say

aababbaaaab · · ·

And let us suppose that a occurs with probability p, and b withprobability 1− p.

How many bits of information can one extract from a long messageof this kind, say with N letters?

Page 3: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

For large N, the message will consist very nearly of pN occurrencesof a and (1− p)N occurrences of b. The number of such messagesis

N!

(pN)!((1− p)N)!∼ NN

(pN)pN((1− p)N)(1−p)N

=1

ppN(1− p)(1−p)N= 2NS (1)

where S is the Shannon entropy per letter

S = −p log p − (1− p) log(1− p).

(The exponent is 2 because in information theory, one usually useslogarithms in base 2.)

Page 4: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

The total number of messages of length N, given our knowledge ofthe relative probability of letters a and b, is roughly

2NS

and so the number of bits of information one gains in actuallyobserving such a message is

NS .

This is an asymptotic formula for large S , since we used only theleading term in Stirling’s formula to estimate the number ofpossible messages, and we ignored fluctuations in the frequenciesof the letters.

Page 5: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Suppose more generally that the message is taken from analphabet with k letters a1, a2, · · · , ak , where the probability toobserve ai is pi , for i = 1, · · · , k. We write A for this probabilitydistribution. In a long message with N >> 1 letters, the symbol aiwill occur approximately Npi times, and the number of suchmessages is asymptotically

N!

(p1N)p1N(p2N)p2N · · · (pkN)pkN∼ NN∏k

i=1(piN)piN= 2NS

where now the entropy per letter is

SA = −k∑

i=1

pi log pi .

This is the general definition of the Shannon entropy of aprobability distribution for a random variable a that takes valuesa1, . . . , ak with probabilities p1, . . . , pk . The number of bits ofinformation that one can extract from a message with N symbolsis again

NSA

.

Page 6: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

From the derivation, since the number 2NSA of possible messages iscertainly at least 1, we have

SA ≥ 0

for any probability distribution. To get SA = 0, there has to beonly 1 possible message, meaning that one of the letters hasprobability 1 and the others have probability 0. The maximumpossible entropy, for an alphabet with k letters, occurs if the pi areall 1/k and is

SA = −k∑

i=1

(1/k) log(1/k) = log k .

(Exercise: prove this by using the method of Lagrange multipliersto maximize −

∑i pi log pi with the constraint

∑i pi = 1.)

Page 7: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In engineering applications, NSA is the number of bits to which amessage with N letters can be compressed. In such applications,the message is typically not really random but contains informationthat one wishes to convey. However, in “lossless encoding,” theencoding program does not understand the message and treats itas random.

It is easy to imagine a situation in which one can make a bettermodel by incorporating short range correlations between theletters. (For instance, the “letters” might be words in a message inthe English language, and English grammar and syntax woulddictate short range correlations.) A model incorporating suchcorrelations would be a 1-dimensional classical spin chain of somekind with short range interactions. Estimating the entropy of along message of N letters would be a problem in classicalstatistical mechanics. But in the ideal gas limit, in which we ignorecorrelations, the entropy of a long message is just NS where S isthe entropy of a message consisting of only one word.

Page 8: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Even in the ideal gas model, we are making statements that areonly natural in the limit of large N. To formalize the analogy withstatistical mechanics, one could introduce a classical HamiltonianH whose value for the i th symbol ai is − log pi , so that theprobability of the i th symbol in the thermodynamic ensemble is2−H(ai ) = pi . Notice then that in estimating the number ofpossible messages for large N, we ignored the difference betweenthe canonical ensemble (defined by probabilities 2−H) and themicrocanonical ensemble (in which one specifies the precisenumbers of occurrences of different letters). As is usual instatistical mechanics, the different ensembles are equivalent forlarge N. The equivalence between the different ensembles isimportant in classical and quantum information theory.

Page 9: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now let us consider the following situation. Alice is trying tocommunicate with Bob, and she sends a message that consists ofmany letters, each being an instance of a random variable x whosepossible values are x1, · · · , xk . She sends the message over a noisytelephone connection, and what Bob receives is many copies of arandom variable y, drawn from an alphabet with letters y1, · · · , yr .(Bob might confuse some of Alice’s letters and misunderstandothers.) How many bits of information does Bob gain after Alicehas transmitted a message with N letters? To analyze this, let ussuppose that p(xi , yj) is the probability that, in a given occurrence,Alice sends x = xi and Bob hears y = yj . The probability that Bobhears y = yj , summing over all choices of what Alice intended, is

p(yj) =∑i

p(xi , yj).

If Bob does hear y = yj , his estimate of the probability that Alicesent xi is the conditional probability

p(xi |yj) =p(xi , yj)

p(yj).

Page 10: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

From Bob’s point of view, once he has heard y = yj , his estimateof the remaining entropy in Alice’s signal is the Shannon entropy ofthe conditional probability distribution:

SX |y=yj = −∑i

p(xi |yj) log(p(xi |yj)).

Averaging over all possible values of y, the average remainingentropy, once Bob has heard y, is∑j

p(yj)SX |y=yj = −∑j

p(yj)∑i

p(xi , yj)

p(yj)log

(p(xi , yj)

p(yj)

)= −

∑i ,j

p(xi , yj) log p(xi , yj) +∑i ,j

p(xi , yj) log p(yj)

= SXY − SY .

Here SXY is the entropy of the joint distribution p(xi , yj) for thepair x, y and SY is the entropy of the probability distributionp(yj) =

∑i p(xi , yj) for y only.

Page 11: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

The difference SXY − SY is called the conditional entropy S(X |Y );it is the entropy that remains in the probability distribution X onceY is known. Since it was obtained as a sum of ordinary entropiesSX |y=yj with positive coefficients, it is clearly positive:

SXY − SY ≥ 0.

(This is NOT true quantum mechanically!) Since SX is the totalinformation content in Alice’s message, and SXY − SY is theinformation content that Bob still does not have after observing Y ,it follows that the information about X that Bob does gain whenhe receives Y is the difference or

I (X ,Y ) = SX − SXY + SY .

Here I (X ,Y ) is called the mutual information between X and Y .It measures how much we learn about X by observing Y .

Page 12: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

This interpretation convinces us that I (X ,Y ) must be nonnegative.One can prove this directly but instead I want to deduce it fromthe properties of one more quantity – and this will essentiallycomplete our cast of characters. I will motivate it as follows.Suppose we are observing a random variable x, for example thefinal state in the decays of a radioactive nucleus. We have a theorythat predicts a probability distribution Q for the final state, say theprediction is that the probability to observe final state i is qi . Butmaybe our theory is wrong and the decay is actually described bysome different probability distribution P, such that the probabilityof the i th final state is pi . After observing the decays of N atoms,how sure could we be that the initial hypothesis is wrong?

Page 13: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

If the correct probability distribution is P, then after observing Ndecays, we will see outcome i approximately piN times. We willjudge the probability of what we have seen to be

P =N∏i=1

qpiNi

N!∏Nj=1(pjN)!

.

We already calculated that for large N

N!∏Nj=1(pjN)!

∼ 2−N∑

i pi log pi

soP ∼ 2−N

∑i pi (log pi−log qi ).

This is 2−NS(P||Q) where the relative entropy (per observation) orKullback-Liebler divergence is defined as

S(P||Q) =∑i

pi (log pi − log qi ).

Page 14: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

From the derivation, S(P||Q) is clearly nonnegative, and zero onlyif P = Q, that is if the initial hypothesis is correct. If the initialhypothesis is wrong, we will be sure of this once

NS(P||Q) >> 1.

(Here we’ve ignored noise in the observations, which we couldincorporate using what we learned in our discussion of conditionalentropy.)

S(P||Q) is an important measure of the difference between twoprobability distributions P and Q, but notice that it is asymmetricin P and Q. We broke the symmetry by assuming that Q was ourinitial hypothesis and P was the correct answer.

Page 15: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now we will use this to prove positivity of the mutual information.We consider a pair of random variables x, y and we consider twodifferent probability distributions. One, which we will call P, isdefined by a possibly correlated joint probability distribution

p(xi , yj).

Given such a joint probability distribution, the separate probabilitydistributions for x and for y are

p(xi ) =∑j

p(xi , yj), p(yj) =∑i

p(xi , yj).

(I will always use this sort of notation for a reduced probabilitydistribution in which some variable is “integrated out” or summedover.)

Page 16: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We define a second probability distribution Q for x, y by ignoringthe correlations between them:

q(xi , yj) = p(xi )p(yj).

Now we calculate the relative entropy between these twodistributions:

S(P||Q) =∑i ,j

p(xi , yj)(log p(xi , yj)− log(p(xi )p(yj)))

=∑i ,j

p(xi , yj)(log p(xi , yj)− log p(xi )− log p(yj))

=SX + SY − SXY = I (X ,Y ).

Page 17: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Thus I (X ,Y ) ≥ 0, with equality only if the two distributions arethe same, meaning that x and y were uncorrelated to begin with.

The propertySX + SY − SXY ≥ 0

is called subadditivity of entropy.

Page 18: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now there is one more very important property of S(P||Q) that Iwant to explain, and this will more or less conclude ourintroduction to classical information theory. Suppose that x and yare two random variables. Let PXY and QXY be two probabilitydistributions, described by functions p(xi , yj) and q(xi , yj). If westart with a hypothesis QXY for the joint probability, then aftermany trials in which we observe x and y, our confidence that weare wrong is determined by S(PXY ||QXY ). But suppose that weonly observe x and not y. The reduced distributions PX and QX

for X only are described by functions

p(xi ) =∑j

p(xi , yj), q(xi ) =∑j

q(xi , yj).

If we observe x only, then the confidence after many trials that theinitial hypothesis is wrong is controlled by S(PX ||QX ).

Page 19: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

It is harder to disprove the initial hypothesis if we observe only X ,so

S(PXY ||QXY ) ≥ S(PX ||QX ).

This is called monotonicity of relative entropy.

Concretely, if we observe a sequence xi1 , xi2 , . . . xiN in N trials, thento estimate how unlikely this is, we will imagine a sequence of y ’sthat minimizes the unlikelihood of the joint sequence(xi1 , yi1), (xi2 , yi2), · · · (xiN , yiN ). An actual sequence of y ’s that wemight observe can only be more unlikely than this. So observing yas well as x can only increase our estimate of how unlikely theoutcome was, given the sequence of the x ’s. Thus, the relativeentropy only goes down upon “integrating out” some variables andnot observing them.

Page 20: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

I think if you understand what I’ve said, you should regard it as aproof. However, I will also give a proof in formulas.

Page 21: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

The inequality S(PXY ||QXY )− S(PX ||QX ) ≥ 0 can be written∑i ,j

p(xi , yj)

(log

(p(xi , yj)

q(xi , yj)

)− log

(p(xi )

q(xi )

))≥ 0.

Equivalently∑i

p(xi )∑j

p(xi , yj)

p(xi )log

(p(xi , yj)/p(xi )

q(xi , yj)/q(xi )

)≥ 0.

The left hand side is a sum of positive terms, since it is∑i

p(xi )S(PY |x=xi ||QY |x=xi ),

where we define probability distributions PY |x=xi , QY |x=xiconditional on observing x = xi :

p(yj)|x=xi = p(xi , yj)/p(xi ), q(yj)|x=xi = q(xi , yj)/q(xi ).

Page 22: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

So this establishes monotonicity of relative entropy. An importantspecial case is strong subadditivity of entropy. For this, we considerthree random variables x, y, z.The combined system has a jointprobability distribution PXYZ described by a function p(xi , yj , zk).Alternatively, we could forget the correlations between X and YZ ,defining a probability distribution QXYZ for the system XYZ by

q(xi , yj , zk) = p(xi )p(yj , zk)

where as usual

p(xi ) =∑j ,k

p(xi , yj , zk), p(yj , zk) =∑i

p(xi , yj , zk).

The relative entropy is S(PXYZ ||QXYZ ).

Page 23: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

But what if we only observe the subsystem XY ? Then we replacePXYZ and QXYZ by probability distributions PXY , QXY with

p(xi , yj) =∑k

p(xi , yj , zk), q(xi , yj) =∑k

q(xi , yj , zk) = p(xi )p(yj)

and we can define the relative entropy S(PXY ||QXY ).

Page 24: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Monotonicity of relative entropy tells us that

S(PXYZ ||QXYZ ) ≥ S(PXY ||QXY ).

But the relation between relative entropy and mutual informationthat we discussed a moment ago gives

S(PXYZ ||QXYZ ) = I (X ,YZ ) = SX − SXYZ + SYZ

andS(PXY |QXY ) = I (X ,Y ) = SX − SXY + SY .

SoSX − SXYZ + SYZ ≥ SX − SXY + SY

orSXY + SYZ ≥ SX + SXYZ ,

which is called strong subadditivity. Remarkably, the samestatement turns out to be true in quantum mechanics, where it isboth powerful and surprising.

Page 25: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

One final comment before we get to the quantum mechanical case.We repeatedly made use of the ability to define a conditionalprobability distribution, conditional on some observation. There isnot a good analog of this in the quantum mechanical case and it isa bit of a miracle that many of the conclusions nonetheless havequantum mechanical analogs. The most miraculous is strongsubadditivity.

Page 26: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now we turn to quantum information theory. Quantum mechanicsalways deals with probabilities, but the real quantum analog of aclassical probability distribution is not a quantum state but adensity matrix. Depending on one’s view of quantum mechanics,one might believe that the whole universe is described by aquantum mechanical pure state that depends on all the availabledegrees of freedom. Even if this is true, one usually studies asubsystem that cannot be described by a pure state.

Page 27: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

For an idealized case, let A be a subsystem of interest, with Hilbertspace HA and let B be everything else of relevance, or possibly allof the rest of the universe, with Hilbert space HB . The combinedHilbert space is the tensor product HAB = HA ⊗HB . The simplecase is that a state vector ψAB of the combined system is thetensor product of a state vector ψA ∈ HA and another state vectorψB ∈ HB :

ψAB = ψA ⊗ ψB .

If ψAB is a unit vector, we can choose ψA and ψB to also be unitvectors. In the case of such a product state, measurements of theA system can be carried out by forgetting about the B system andusing the state vector ψA. If OA is any operator on HA, then thecorresponding operator on HAB is OA ⊗ 1B , and its expectationvalue in a factorized state ψAB = ψA ⊗ ψB is

〈ψAB |OA ⊗ 1B |ψAB〉 = 〈ψA|OA|ψA〉〈ψB |1B |ψB〉 = 〈ψA|OA|ψA〉.

Page 28: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

However, a generic pure state ψAB ∈ HAB is not a product state;instead it is “entangled.” If HA and HB have dimensions N andM, then a generic state in HAB is an N ×M matrix, for examplein the 2× 3 case

ψAB =

(∗ ∗ ∗∗ ∗ ∗

).

By unitary transformations on HA and on HB , we can transformψAB to

ψAB → UψABV

where U and V are N × N and M ×M unitaries. The canonicalform of a matrix under that operation is a diagonal matrix, withpositive numbers on the diagonal, and extra rows or columns ofzeroes, for example (√

p1 0 00

√p2 0

).

Page 29: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

A slightly more invariant way to say this is that any pure state canbe written

ψAB =∑i

√piψ

iA ⊗ ψi

B ,

where we can assume that ψiA and ψi

B are orthonormal,

〈ψiA, ψ

jA〉 = 〈ψi

B , ψjB〉 = δij

and that pi > 0. This is called the Schmidt decomposition. (TheψiA and ψi

B may not be bases of HA or HB , because there may notbe enough of them.) The condition for ψAB to be a unit vector isthat ∑

i

pi = 1,

so we can think of the pi as probabilities.

Page 30: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

What is the expectation value in such a state of an operator OA

that only acts on A? It is

〈ψAB |OA ⊗ 1B |ψAB〉 =∑i

pi 〈ψiA|OA|ψj

A〉〈ψiB |1B |ψ

jB〉

=∑i

pi 〈ψiA|OA|ψi

A〉.

This is the same asTrHA

ρAOA,

where ρA is the density matrix

ρA =∑i

pi |ψiA〉〈ψi

A|.

Thus, if we are only going to make measurements on system A, wedo not need a wavefunction of the universe: it is sufficient to havea density matrix for system A.

Page 31: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

From the definition,

ρA =∑i

pi |ψiA〉〈ψi

A|, (∗)

we see that ρA is hermitian and positive semi-definite. Because∑i pi = 1, ρA has trace 1:

TrHAρA = 1.

Conversely, every matrix with those properties can be “purified,”meaning that it is the density matrix of some pure state on some“bipartite” (or two-part) system AB. For this, we first observe thatany hermitian matrix ρA can be diagonalized, meaning that in asuitable basis it takes the form (∗); moreover, if ρA ≥ 0, then thepi are likewise positive (if one of the pi vanishes, we omit it fromthe sum).

Page 32: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Having gotten this far, to realize ρA as a density matrix we simplyintroduce another Hilbert space HB with orthonormal states ψi

B

and observe that ρA is the density matrix of the pure state

ψAB =∑i

√piψ

iA ⊗ ψi

B ∈ HA ⊗HB .

Here ψAB is not unique (even after we choose B) but it is uniqueup to a unitary transformation of HB .

Page 33: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In this situation, ψAB is called a “purification” of the densitymatrix ρA. The existence of purifications is a nice property ofquantum mechanics that has no classical analog: the classicalanalog of a density matrix is a probability distribution, and there isno notion of purifying a probability distribution.

Page 34: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

If there is more than one term in the expansion

ψAB =∑i

√piψ

iA ⊗ ψi

B ∈ HA ⊗HB ,

we say that systems A and B are entangled in the state ψAB . Ifthere is only one term, the expansion reduces to

ψAB = ψA ⊗ ψB ,

an “unentangled” tensor product state. Then system A can bedescribed by the pure state ψA and the density matrix is of rank 1:

ρA = |ψA〉〈ψA|.

If ρA has rank higher than 1, we say that system A is in a mixedstate. If ρA is invertible, we say that A is fully mixed (this will bethe situation in quantum field theory) and if ρA is a multiple of theidentity, we say that A is maximally mixed.

Page 35: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In the general case

ρA =∑i

pi |ψiA〉〈ψi

A|, (∗)

you will describe all measurements of system A correctly if you saythat system A is in the state ψi

A with probability pi . However, onehas to be careful here because the decomposition (∗) is not unique.It is unique if the pi are all distinct and one wants the number ofterms in the expansion to be as small as possible, or equivalently ifone wants the ψi

A to be orthonormal. But if one relaxes thoseconditions, then (except for a pure state) there are many ways tomake the expansion (∗). This means that if Alice prepares aquantum system to be in the pure state ψi with probability pi , thenby measurements of that system, there is no way to determine thepi or the ψi . Any measurement of the system will depend only on

ρ =∑i

pi |ψi 〉〈ψi |.

Beyond that, there is no way to get information about how thesystem was prepared.

Page 36: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Before going on, perhaps I should give a simple example of aconcrete situation in which it is impractical to not use densitymatrices. Consider an atom in a cavity illuminated by photons. Aphoton entering the cavity might be scattered, or it might beabsorbed and reemitted:

After a certain time, the atom is again alone in the cavity. Afterthe atom has interacted with n photons, to give a pure statedescription, we need a joint wavefunction for the atom and all nphotons. The mathematical machinery gets bigger and bigger,even though (assuming we observe only the atom) the physicalsituation is not changing. By using a density matrix, we get amathematical framework that does not change regardless of howmany photons have interacted with the atom in the past (andwhat else those photons might have interacted with). All we needis a density matrix for the atom.

Page 37: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

The von Neumann entropy of a density matrix ρA is defined by aformula analogous to the Shannon entropy of a probabilitydistribution:

S(ρA) = −Tr ρA log ρA.

IfρA =

∑i

pi |ψiA〉〈ψi

A|,

with ψiA being orthonormal, then

ρA log ρA =

p1 log p1

p2 log p2p3 log p3

. . .

and so

S(ρA) = −∑i

pi log pi ,

the same as the Shannon entropy of a probability distribution {pi}.

Page 38: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

An immediate consequence is that, just as for the Shannon entropy,

S(ρA) ≥ 0,

with equality only for a pure state (one of the p’s being 1 and theothers 0). The formula S(ρA) = −

∑i pi log pi also implies the

same upper bound that we had classically for a system with kstates

S(ρA) ≤ log k,

with equality only if ρA is a multiple of the identity:

ρA =1

k

1

11

. . .

.

In this case, we say that the A is in a maximally mixed state. Infact, the von Neumann entropy has many properties analogous tothe Shannon entropy, but the explanations required are usuallymore subtle and there are key differences.

Page 39: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Here is a nice property of the von Neumann entropy that does nothave a classical analog. If a bipartite system AB is in a pure state

ψAB =∑i

√piψ

iA ⊗ ψi

B ∈ HA ⊗HB ,

then the density matrices of systems A and B are

ρA =∑i

pi |ψiA〉〈ψi

A|,

and likewiseρB =

∑i

pi |ψiB〉〈ψi

B |.

The same constants pi appear in each, so clearly

S(ρA) = S(ρB).

Thus a system A and a purifying system B always have the sameentropy.

Page 40: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

An important property of the von Neumann entropy is concavity.Suppose that ρ1 and ρ2 are two density matrices, and setρ(t) = tρ1 + (1− t)ρ2, for 0 ≤ t ≤ 1. Then

d2

dt2S(ρ(t)) ≤ 0.

To show this, we first compute that

d

dtS(ρ(t)) = −Tr ρ̇ log ρ.

(Exercise!) Then as

log ρ =

∫ ∞0

ds

(1

s + 1− 1

s + ρ(t)

)and ρ̈ = 0, we have

d2

dt2S(ρ(t)) = −

∫ ∞0

dsTr ρ̇1

s + ρ(t)ρ̇

1

s + ρ(t).

The integrand is positive, as it is TrB2, where B is the self-adjointoperator (s + ρ(t))−1/2ρ̇(t)(s + ρ(t))−1/2. So d2

dt2S(ρ(t)) ≤ 0.

Page 41: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In other words, the function S(ρ(t)) is concave. So the straightline connecting two points on its graph lies below the graph:

tS(ρ1) + (1− t)S(ρ2) ≤ S(tρ1 + (1− t)ρ2) = S(ρ(t)).

More generally, let ρi , i = 1, . . . , n be density matrices and pi ,i = 1, . . . , n nonnegative numbers with

∑i pi = 1. Then by

induction from the above, or because this is a general property ofconcave functions, we have∑

i

piS(ρi ) ≤ S(ρ), ρ =∑i

piρi .

This may be described by saying that entropy can only increaseunder mixing. The nonnegative quantity that appears here isknown as the Holevo information or Holevo χ:

χ = S(ρ)−∑i

piS(ρi ).

Page 42: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Here is an example. Let ρ be any density matrix and (in somebasis) let ρD be the corresponding diagonal density matrix. Letρ(t) = (1− t)ρD + tρ. Then

d

dtS(ρ(t))

∣∣∣∣t=0

= −Tr ρ̇(0) log ρD = 0

since ρD is diagonal and ρ̇(0) = dρ/dt|t=0 is purely off-diagonal.Also by concavity S̈(ρ(t)) ≤ 0. So S(ρ(1)) ≤ S(ρ(0)) or

S(ρ) ≤ S(ρD)

with equality only if ρ = ρD .

Page 43: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Just as for classical probability distributions, for density matriceswe can always “integrate out” an unobserved system and get areduced density matrix for a subsystem. Classically, given a jointprobability distribution p(xi , yj) for a bipartite system XY , we“integrated out” y to get a probability distribution for x only:

p(xi ) =∑j

p(xi , yj).

The quantum analog of that is a partial trace. Suppose that AB isa bipartite system with Hilbert space HA ⊗HB and a densitymatrix ρAB .

Page 44: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Concretely, if |ai 〉, i = 1, . . . , n are an orthonormal basis of HA and|bα〉, α = 1, . . . ,m are an orthonormal basis of HB , then a densitymatrix for AB takes the general form

ρAB =∑

i ,i ′,α,α′

cii ′αα′ |ai 〉 ⊗ |bα〉〈ai ′ | ⊗ 〈bα′ |.

The reduced density matrix for measurements of system A only isobtained by setting α = α′ and summing:

ρA =∑i ,i ′α

ci ,i ′,α,α|ai 〉〈ai ′ |.

This is usually written as a partial trace:

ρA = TrHBρAB ,

the idea being that one has “traced out” HB , leaving a densityoperator on HA. Likewise (summing over i to eliminate HA)

ρB = TrHAρAB .

Page 45: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

It is now possible to formally imitate some of the other definitionsthat we made in the classical case. For example, if AB is a bipartitesystem, we define what is called quantum conditional entropy

S(A|B) = SAB − SB .

I need to warn you, though, that the name is a little deceptivebecause there is not a good quantum notion of conditionalprobabilities. A fundamental difference from the classical case isthat S(A|B) can be negative. In fact, suppose that system AB isin an entangled pure state. Then SAB = 0 but as system B is in amixed state, SB > 0. So in this situation S(A|B) < 0.Nevertheless, one can give a reasonable physical interpretation toS(A|B). I will come back to that when we discuss quantumteleportation.

Page 46: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Another classical definition that is worth imitating is the mutualinformation. Given a bipartite system AB with density matrix ρAB ,the mutual information is defined just as it is classically:

I (A,B) = SA − SAB + SB .

Here, however, we are more fortunate, and the quantum mutualinformation is nonnegative:

I (A,B) ≥ 0.

Moreover, I (A,B) = 0 if and only if the density matrix factorizes,in the sense that

ρAB = ρA ⊗ ρB .

Positivity of mutual information is also called subadditivity ofentropy.

Page 47: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Before proving positivity of mutual information, I will explain aninteresting corollary. Although conditional entropy S(A|B) can benegative, the possibility of “purifying” a density matrix gives alower bound on S(A|B). Let C be such that ABC is in a purestate. Remember that in general if XY is in a pure state thenSX = SY . So if ABC is in a pure state then SAB = SC andSB = SAC . Thus

SAB − SB = SC − SAC ≥ −SA,

where the last step is positivity of mutual information. So

S(A|B) = SAB − SB ≥ −SA.

This is the Araki-Lieb inequality; it is saturated if SAB = 0 whichimplies SB = SA. This has been a typical argument exploiting theexistence of purifications.

Page 48: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Just as in the classical case, to understand positivity of the mutualinformation, it helps to first define the relative entropy. Supposethat ρ and σ are two density matrices on the same Hilbert spaceH.The relative entropy can be defined by imitating the classicalformula:

S(ρ||σ) = Trρ(log ρ− log σ).

S(ρ||σ) turns out to have the same interpretation that it doesclassically: if your hypothesis is that a quantum system isdescribed by σ, and it is actually described by ρ, then to learn thatyou are wrong, you need to observe N copies of the system where

NS(ρ||σ) >> 1.

Page 49: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Just as classically, it turns out that S(ρ||σ) ≥ 0 with all ρ, σ, withequality precisely if ρ = σ. We will prove that in a moment, butfirst let us use it to prove that I (A,B) ≥ 0 for any density matrixρAB . Imitating the classical proof, we define

σAB = ρA ⊗ ρB ,

and we observe that

log σAB = log ρA ⊗ 1B + 1A ⊗ log ρB ,

so

S(ρAB ||σAB) = TrABρAB(log ρAB − log σAB)

= TrABρAB(log ρAB − log ρA ⊗ 1B − 1B ⊗ log ρB)

= SA + SB − SAB = I (A,B).

So just as classically, positivity of the relative entropy impliespositivity of the mutual information. (This is also calledsubadditivity of entropy.)

Page 50: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

To prove positivity of the relative entropy, first observe that we canpick a basis in which σ is diagonal. If ρ is diagonal in the samebasis, what we want reduces to the classical case. For if (say)σ = diag(q1, . . . , qn), ρ = diag(p1, . . . , pn), then

S(ρ||σ) =∑i

pi (log pi − log qi ),

which can be interpreted as a classical relative entropy and so isnonnegative.

Page 51: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In the general case, let ρD be obtained from ρ by droppingoff-diagonal matrix elements in the basis in which σ is diagonal.Then straight from the definitions we have

S(ρ||σ) = Trρ(log ρ− log σ) = S(ρD ||σ) + S(ρD)− S(ρ).

We just proved that S(ρD ||σ) ≥ 0, and earlier we used concavityto prove S(ρD)− S(ρ) ≥ 0, so

S(ρ||σ) ≥ 0,

with equality only if ρ = σ.

Page 52: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

So relative entropy is positive, just as it is classically. Do we dareto hope that relative entropy is also monotonic, as classically?YES, as first proved by E. Lieb and M. B. Ruskai (1971). Iconsider this a miracle, because, as there is no such thing as ajoint probability distribution for general quantum observables, theintuition behind the classical statement is not applicable in anyobvious way. Rather, strong subadditivity is ultimately used toprove that quantities such as quantum conditional entropy andquantum relative entropy and quantum mutual information dohave properties somewhat similar to the classical case.

Page 53: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

There are different statements of monotonicity of relative entropy,but a basic one is monotonicity under partial trace. If AB is abipartite system with two density matrices ρAB and σAB , then wecan also take a partial trace on B to get reduced density matriceson A:

ρA = TrBρAB , σA = TrBσAB .

Monotonicity of relative entropy under partial trace is thestatement that taking a partial trace can only reduce the relativeentropy:

S(ρAB ||σAB) ≥ S(ρA||σA).

Page 54: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

One of our main goals in lecture 2 will be to describe how this canbe proved, so we will not discuss that now. I will just say thatmonotonicity of relative entropy has “strong subadditivity” as acorollary. We can prove it by imitating what we said classically. Weconsider a tripartite system ABC with density matrix ρABC . Thereare reduced density matrices such as ρA = TrBCρABC ,ρBC = TrAρABC , etc., and we define a second density matrix

σABC = ρA ⊗ ρBC .

The reduced density matrices of ρABC and σABC , obtained bytracing out C , are

ρAB = TrCρABC , σAB = TrCσABC = ρA ⊗ ρB .

Page 55: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Monotonicity of relative entropy under partial trace says that

S(ρABC ||σABC ) ≥ S(ρAB ||σAB). (∗)

But (as in our discussion of positivity of mutual information)

S(ρABC ||σABC ) = S(ρABC ||ρA⊗ρBC ) = I (A,BC ) = SA+SBC−SABC

and similarly

S(ρAB ||σAB) = S(ρAB ||ρA ⊗ ρB) = I (A,B) = SA + SB − SAB .

So (∗) becomes monotonicity of mutual information

I (A,BC ) ≥ I (A,B)

or equivalently strong subadditivity

SAB + SBC ≥ SB + SABC .

Note that these steps are the same as they were classically.

Page 56: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Using purifications, one can find various equivalent statements. IfABCD is in a pure state then SAB = SCD , SABC = SD so theinequality becomes

SCD + SBC ≥ SB + SD .

So for instance S(C |D) = SCD − SD can be negative, orS(C |B) = SBC − SB can be negative, but

S(C |D) + S(C |B) ≥ 0.

(This is related to “monogamy of entanglement”: a given qubit inC can be entangled with D, reducing SCD , or with B, reducingSBC , but not both.)

Page 57: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Classically, the intuition behind monotonicity of mutual information

I (A,BC ) ≥ I (A,B)

is that knowing both B and C will tell you at least as much aboutA as you would learn from knowing B only. Quantummechanically, it is just not obvious that the formal definitionI (A,B) = SA − SAB + SB fits that intuition. It takes the deepresult of monotonicity of relative entropy – or strong subadditivity– to show that it does.

Page 58: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In general, strong subadditivity (or monotonicity of relativeentropy) is the key to most of the interesting statements inquantum information theory. Most useful statements that are notmore trivial are deduced from strong subadditivity.

Page 59: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Once we start using density matrices, there are a few more toolswe should add to our toolkit. First let us discuss measurements.Textbooks begin with “projective measurements,” which involveprojection onto orthogonal subspaces of a Hilbert space H ofquantum states. We pick positive commuting hermitian projectionoperators πs , s = 1, · · · , k obeying∑

s

πs = 1, π2s = πs , πsπs′ = πs′πs .

A measurement of a state ψ involving these projection operatorshas outcome s with probability

ps = 〈ψ|πs |ψ〉.

These satisfy∑

s ps = 1 since∑

s πs = 1. If instead of a purestate ψ the system is described by a density matrix ρ, then theprobability of outcome s is

ps = TrH πsρ.

Page 60: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

But Alice can make a more general type of measurement using anauxiliary system C with Hilbert space C. We suppose that C iss-dimensional with a basis of states |s〉, s = 1, · · · , k . Aliceinitializes C in the state |1〉. Then she acts on the combinedsystem C ⊗H with a unitary transformation U, which she achievesby suitably adjusting a time-dependent Hamiltonian. She choosesU so that for any ψ ∈ H

U(|1〉 ⊗ ψ) =k∑

s=1

|s〉 ⊗ Esψ

for some linear operators Es . (She doesn’t care about what U doeson other states.) Unitarity of U implies that

k∑s=1

E †s Es = 1,

but otherwise the Es are completely arbitrary.

Page 61: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Then Alice makes a projective measurement of the system C ⊗H,using the commuting projection operators

πs = |s〉〈s| ⊗ 1,

which have all the appropriate properties. The probability ofoutcome s is

ps = |Es |ψ〉|2 = 〈ψ|E †s Es |ψ〉.

More generally, if the system H is described initially by a densitymatrix ρ, then the probability of outcome s is

ps = TrE †s Esρ.

The ps are nonnegative because E †s Es is nonnegative, and∑s ps = 1 because

∑s E†s Es = 1. But the E †s Es are not

commuting projection operators; they are just nonnegativehermitian operators that add to 1. What we’ve described is a moregeneral kind of quantum mechanical measurement of the originalsystem. (In the jargon, this is a “positive operator-valuedmeasurement” or POVM.)

Page 62: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now let us view this process from another point of view. How cana density matrix evolve? The usual Hamiltonian evolution of astate ψ is ψ → Uψ for a unitary operator U, and on the densitymatrix it corresponds to

ρ→ UρU−1.

But let us consider Alice again with her extended system C ⊗H.She initializes the extended system with density matrix

ρ̂ = |1〉〈1| ⊗ ρwhere ρ is a density matrix on H. Then she applies the sameunitary U as before, mapping ρ̂ to

ρ̂′ = U ρ̂U−1 =t∑

s,s′=1

|s〉〈s ′| ⊗ EsρE†s′ .

The induced density matrix on the original system H is obtained bya partial trace and is:

ρ′ = TrC ρ̂′ =

k∑s=1

EsρE†s .

Page 63: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We have found a more general way that density matrices canevolve. The operation

ρ→k∑

s=1

EsρE†s ,

∑s

E †s Es = 1

is called a “quantum channel,” and the Es are called Krausoperators.

This is actually the most general physically sensible evolution of adensity matrix (and a POVM is the most general possiblemeasurement of a quantum system).

Page 64: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now let ρ and σ be two different density matrices on H. Let usask what happens to the relative entropy S(ρ||σ) when we apply aquantum channel, mapping ρ and σ to

ρ′ =∑s

EsρE†s , σ′ =

∑s

EsσE†s .

The first step of initialization, replacing ρ and σ by |1〉〈1| ⊗ ρ and|1〉〈1| ⊗ σ, doesn’t change anything. The second step, conjugatingby a unitary matrix U, also doesn’t change anything since thedefinition of relative entropy is invariant under conjugation. Finallythe last step was a partial trace, which can only reduce the relativeentropy. So relative entropy can only go down under a quantumchannel:

S(ρ||σ) ≥ S(ρ′||σ′).

This is the most general statement of monotonicity of relativeentropy.

Page 65: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

As an example of this, let us suppose that σ is a thermal densitymatrix at some temperature T = 1/β

σ =1

Zexp(−βH).

So log σ = −βH − logZ and therefore the relative entropybetween any density matrix ρ and σ is

S(ρ||σ) =Tr ρ(log ρ− log σ) = −S(ρ) + Trρ(βH + logZ )

=β(E (ρ)− TS(ρ)) + logZ (2)

where the average energy computed in the density matrix ρ is

E (ρ) = Tr ρH.

We define the free energy

F (ρ) = E (ρ)− TS(ρ).

Note that the logZ term is independent of ρ and gives a constantthat ensures that S(σ||σ) = 0. So

S(ρ||σ) = β(F (ρ)− F (σ)).

Page 66: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now consider any evolution of the system, that is any quantumchannel, that preserves thermal equilibrium at temperature β.Thus, this channel maps σ to itself, but it maps ρ to a generallydifferent density matrix ρ′. The relative entropy can only go downunder a quantum channel, so

S(ρ||σ) ≥ S(ρ′||σ),

and thereforeF (ρ) ≥ F (ρ′).

In other words, the free energy can only go down under a quantumchannel that preserves thermal equilibrium. This is an aspect ofthe second law of thermodynamics. If you stir a system in a waythat maps thermal equilibrium at temperature T to thermalequilibrium at the same temperature, then it moves any densitymatrix closer to thermal equilibrium at temperature T .

Page 67: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Exercises:(1) Let ψ be an arbitrary pure state. Try to find Kraus operatorsof a quantum channel that maps any density matrix to |ψ〉〈ψ|.Can you describe a physical realization?(2) Same question for a quantum channel that maps any densitymatrix to a maximally mixed density matrix, a multiple of theidentity.(3) Same question for a quantum channel that maps any ρ = (ρij)to the corresponding diagonal density matrixρD = diag(ρ11, ρ22, · · · , ρnn).

Page 68: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Our next topic will be quantum teleportation. For a first example,imagine that Alice has in her possession a “qubit” A0, a quantumsystem with a two-dimensional Hilbert space. Alice would like tohelp Bob create in his lab a qubit in a state identical to A0.However, she doesn’t have the technology to actually send a qubit;she can only communicate by sending a classical message over thetelephone. If Alice knows the state of her qubit, there is noproblem: she tells Bob the state of her qubit and he creates onelike it in his lab. If, however, Alice does not know the state of herqubit, she is out of luck. All she can do is make a measurement,which will give some information about the prior state of qubit A0.She can tell Bob what she learns, but the measurement will destroythe remaining information about A0 and it will never be possiblefor Bob to recreate A0.

Page 69: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Suppose, however, that Alice and Bob have previously shared aqubit pair A1B1 (Alice has A1, Bob has B1) in a known entangledstate, for example

ΨA1B1 =1√2

(|0 0〉+ |1 1〉) .

Maybe Alice created this pair in her lab and then Bob took B1 onthe road with him, leaving A1 in Alice’s lab. In this case, Alice cansolve the problem. To do so she makes a joint measurement of hersystem A0A1 in a basis that is chosen so that no matter what theanswer is, Alice learns nothing about the prior state of A0. In thatcase, she also loses no information about A0. But after getting hermeasurement outcome, she knows the full state of the system andshe can tell Bob what to do to create A0.

Page 70: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

To see how this works, let us describe a specific measurement thatAlice can make on A0A1 that will shed no light on the state of A0.She can project A0A1 on the basis of four states

1√2

(|0 0〉 ± |1 1〉) and1√2

(|0 1〉 ± |1 0〉).

To see the result of a measurement, suppose the unknown state ofqubit A0 is α|0〉+ β|1〉. So the initial state of A0A1B1 is

ΨA0A1B1 =1√2

(α|0 0 0〉+ α|0 1 1〉+ β|1 0 0〉+ β|1 1 1〉) .

Suppose that the outcome of Alice’s measurement is to learn thatA0A1 is in the state 1√

2(|0 0〉 − |1 1〉). After the measurement B1

will be in the state α|0〉 − β|1〉. (Exercise!) Knowing this, Alicecan tell Bob that he can recreate the initial state by acting on hisqubit by

ΨB1 →(

1 00 −1

)ΨB1

in the basis |0〉, |1〉. The other cases are similar (Exercise!).

Page 71: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

I want to explain a generalization, but first it is useful to formalizein a different way the idea that Alice is trying to teleport anarbitrary unknown quantum state. For this, we add another systemR, to which Alice does not have access. We assume that R ismaximally entangled with A0 in a known state, say

ΨRA0 =1√2

(|0 0〉+ |1 1〉) .

In this version of the problem, Alice’s goal is to manipulate hersystem A0A1 in some way, and then tell Bob what to do so that inthe end the system RB1 will be in the same state

ΨRB1 =1√2

(|0 0〉+ |1 1〉)

that RA0 was previously – with R never being touched. In thisversion of the problem, the combined system RAB = RA0A1B1

starts in a pure state ΨRAB = ΨRA0 ⊗ΨA1B1 . The solution of thisversion of the problem is the same as the other one: Alice makesthe same measurements and sends the same instructions as before.

Page 72: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We can understand better what is happening if we take a look atthe conditional entropy of the system AB = A0A1B1. Since A1B1

is in a pure state, it does not contribute to SAB and SAB = SA0 = 1(A0 is maximally mixed, since it is maximally entangled with R).Also SB = 1 since B = B1 is maximally entangled with A1. So

S(A|B) = SAB − SB = 1− 1 = 0.

It turns out that this is the key to quantum teleportation:teleportation is possible when and only when

S(A|B) ≤ 0.

Page 73: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Let me explain why this is a necessary condition. We start with anarbitrary system RAB in a pure state ΨRAB ; Alice has access to A,Bob has access to B, and no one has access to R. Forteleportation, Alice will measure her system A using some rank 1orthogonal projection operators πi . No matter what answer shegets, after the measurement, system A is in a pure state andtherefore RB is also in a pure, generally entangled state. Forteleportation, Alice has to choose the πi so that, no matter whatthe outcome she gets, the density matrix ρR of R is the same asbefore. If this is so, then after her measurement, system RB is inan entangled pure state with ρR unchanged. Any two such statescan be converted into each other by a unitary transformation ofsystem B (Exercise!) which Bob can implement. Since she knowsher measurement outcome, Alice knows which entangled state RBis in and knows what instructions to give Bob.

Page 74: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

But do projection operators of Alice’s system with the necessaryproperties exist? The initial state ΨABR is pure so it has

SAB = SR .

Bob’s density matrix at the beginning is

ρB = TrRA ρRAB

where ρRAB is the initial pure state density matrix. By definition

SB = S(ρB).

If Alice gets measurement outcome i , then Bob’s density matrixafter the measurement is

ρiB =1

piTrRA πiρRAB .

Note thatρB =

∑i

piρiB ,

since∑

i πi = 1.

Page 75: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

After the measurement, since A is in a pure state, RB is also in apure state Ψi

RB , so S(ρiB) = SR . But by hypothesis, themeasurement did not change ρR , so SR is unchanged and so equalsthe original SAB . Hence

S(ρiB) = SAB .

If all this is possible

SAB = S(ρiB) =∑i

piS(ρiB).

But earlier we found, using concavity, a general entropy inequalityfor mixing; if as here ρB =

∑i piρ

iB then

S(ρB) ≥∑i

piS(ρiB).

So if teleportation can occur,

SAB =∑i

piS(ρiB) ≤ S(ρB) = SB

and hence S(A|B) = SAB − SB ≤ 0.

Page 76: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Actually, S(A|B) ≤ 0 is sufficient as well as necessary forteleportation, in the following sense. One has to consider theproblem of teleporting not a single system but N copies of thesystem for large N. (This is a standard simplifying device ininformation theory, which unfortunately we haven’t had time toexplore. It generalizes the idea we started with of a long classicalmessage.) So one takes N copies of system RAB for large N, thusreplacing RAB by R⊗NA⊗NB⊗N . This multiplies all the entropiesby N, so it preserves the conditions S(A|B) ≤ 0. Now Alice triesto achieve teleportation by making a complete projectivemeasurement on her system A⊗N . It is very hard to find an explicitset of projection operators πi with the right properties, but it turnsout, remarkably, that for large N, a random choice will work (inthe sense that with a probability approaching 1, the error inteleportation is vanishing for N →∞). This statement actuallyhas strong subadditivity as a corollary. (This is not the approachthat I will take tomorrow.) What I’ve described is the “statemerging” protocol of Horodecki, Oppenheim, and Winter; see alsoPreskill’s notes.

Page 77: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We actually can now give a good explanation of the meaning ofquantum conditional entropy S(A|B). Remember that classicallyS(A|B) measures how many additional bits of information Alicehas to send to Bob after he has already received B, so that he willhave full knowledge of A. Quantum mechanically, suppose thatS(A|B) > 0 and Alice nevertheless wants to share her state withBob, minimizing the quantum communication required. To do this,she first creates some maximally entangled qubit pairs and sendshalf of each pair to Bob. Each time she sends Bob half of a pair,SAB is unchanged but SB goes up by 1 (check!), soS(A|B) = SAB − SB goes down by 1. So S(A|B), if positive, is thenumber of such qubits that Alice must send to Bob to makeS(A|B) negative and so make teleportation possible.

If S(A|B) is negative, teleportation is possible to begin with and−S(A|B) is the number of entangled qubit pairs that Alice andBob can be left with after teleportation.

Page 78: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Now we are going to address the following question: how manybits of information can Alice send to Bob by sending him aquantum system X with an N-dimensional Hilbert space H?One thing Alice can do is to send one of N orthogonal basisvectors in H. Bob can find which one she sent by making ameasurement. So in that way Alice can send logN bits ofinformation. We will see that in fact it is not possible to do better.

Page 79: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

We suppose that Alice wants to encode a random variable x thattakes the values xi , i = 1, . . . , n with probability pi . When x = xi ,she writes down this fact in her notebook C and creates a densitymatrix ρi on system X . If |i〉 is the state of the notebook whenAlice has written that x = xi , then on the combined system CX ,Alice has created the density matrix

ρ̂ =∑i

pi |i〉〈i | ⊗ ρi

Then Alice sends the system X to Bob. Bob’s task is to somehowextract information by making a measurement.

Page 80: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Before worrying about what Bob can do, let us observe that thedensity matrix ρ̂ of the system CX is the one that I mentionedearlier in discussing the entropy inequality for mixing, so themutual information between C and X is

I (C ,X ) = S(ρ)−∑i

piS(ρi ).

Since S(ρi ) ≥ 0 and S(ρ) ≤ logN, it follows that

I (C ,X ) ≤ logN.

If we knew that quantum mutual information has a similarinterpretation to classical mutual information, we would stop hereand say that since I (C ,X ) ≤ logN, at most logN bits ofinformation about the contents of Alice’s notebook have beenencoded in X . The problem with this is that, a priori, quantummutual information is a formal definition. Our present goal is toshow that this formal definition does behave as we might hope.

Page 81: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

What can Bob do on receiving system X? The best he can do is tocombine it with some other system which may include a quantumsystem Y and a measuring apparatus C ′. He acts on the combinedsystem XYC ′ with some sort of quantum channel. Then he forgetsX and Y and looks at C ′ – that is, he takes a partial trace over Xand Y , another quantum channel. The output of the resultingquantum channel is a density matrix of the form

ρC ′ =r∑

α=1

qα|α〉〈α|,

where |α〉 are distinguished states of C ′ – the states that one readsin a classical sense. The outcome of Bob’s measurement is aprobability distribution {qα} for a random variable whose valuesare labeled by α. What Bob learns about the contents of Alice’snotebook is the classical mutual information between Alice’sprobability distribution {pi} and Bob’s probability distribution{qα}. Differently put, what Bob learns is the mutual informationICC ′ .

Page 82: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

To analyze this, we note that before Bob does anything, I (C ,X ) isthe same as I (C ,XYC ′) because YC ′ (Bob’s auxiliary quantumsystem Y and his measuring apparatus C ′) is not coupled to CX .Bob then acts on XYC ′ with a quantum channel, which can onlyreduce I (C ,XYC ′), and then he takes a partial trace over XY ,which also can only reduce the mutual information sincemonotonicity of mutual information tells us that

I (C ,XYC ′) ≥ I (C ,C ′).

So

logN ≥ I (C ,X ) = I (C ,XYC ′)before ≥ I (C ,XYC ′)after ≥ I (C ,C ′)after,

where “before” and “after” mean before and after Bob’smanipulations. Thus Alice cannot encode more than logN bits ofclassical information in an N-dimensional quantum state, though ittakes strong subadditivity (or its equivalents) to prove this.

Page 83: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Finally, we can give a physical meaning to the relative entropyS(ρ||σ) between two density matrices ρ, σ. Recall that classically,if we believe a random variable is governed by a probabilitydistribution Q but it is actually governed by a probabilitydistribution P, then after N trials the ability to disprove the wronghypothesis is controlled by

2−NS(P||Q).

A similar statement holds quantum mechanically: if our initialhypothesis is that a quantum system X has density matrix σ, andthe actual answer is ρ, then after N trials with an optimalmeasurement used to test the initial hypothesis, the confidencethat the initial hypothesis was wrong is controlled in the samesense by

2−NS(ρ||σ).

Page 84: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Let us first see that monotonicity of relative entropy implies thatone cannot do better than that. A measurement is a special caseof a quantum channel, in the following sense. To measure a systemX , you let it interact quantum mechanically with some othersystem YC where Y is any quantum system and C is themeasuring device. After they interact, you look at the measuringdevice and forget the rest. Forgetting the rest is a partial tracethat maps a density matrix βXYC to βC = TrXY βXYC . If C is agood measuring device, this means that in a distinguished basis |α〉its density matrix βC will have a diagonal form

βC =∑α

bα|α〉〈α|.

The “measurement” converts the original density matrix into theprobability distribution {bα}.

Page 85: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

So when we try to distinguish ρ from σ, we use a quantum channelplus partial trace that maps ρ and σ into density matrices for C

ρC =∑α

rα|α〉〈α| σC =∑α

sα|α〉〈α|,

and thereby into classical probability distributions R = {rα} andS = {sα}. The only way to learn that ρ and σ are different is byobserving that R and S are different, a process controlled by

2−NScl(R||S),

where Scl(R||S) is the classical relative entropy between R and S .This is the same as the relative entropy between ρC and σC :

S(ρC ||σC ) = Scl(R||S).

And monotonocity of relative entropy gives

S(ρ||σ) ≥ S(ρC ||σC ).

So finally S(ρ||σ) gives an upper bound on how well we can do:

2−NScl(R||S) ≥ 2−NS(ρ||σ).

Page 86: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In the limit of large N, it is actually possible to saturate thisbound, as follows.

If ρ is diagonal in the same basis in which σ is diagonal, then bymaking a measurement that involves projecting on 1-dimensionaleigenspaces of σ, we could convert the density matrices ρ, σ intoclassical probability distributions R, S with S(ρ||σ) = Scl(R||S).The quantum problem would be equivalent to a classical problem,even without taking N copies. As usual the subtlety comesbecause the matrices are not simultaneously diagonal.By droppingfrom ρ the off-diagonal matrix elements in some basis in which σ isdiagonal, we can always construct a diagonal density matrix ρD .Then a measurement projecting on 1-dimensional eigenspaces of σwill give probability distributions R, S satisfying

S(ρD ||σ) = Scl(R||S).

But it is hard to compare S(ρD ||σ) to S(ρ||σ). That is why it isnecessary to take N copies, which does make it possible tocompare S(ρD ||σ) to S(ρ||σ), as we will see.

Page 87: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Taking N copies replaces the Hilbert space H of system X byH⊗N , and replaces the density matrices ρ, ρD , and σ by ρ⊗N , ρ⊗ND ,and σ⊗N . Let us recall the definition of relative entropy:

S(ρ⊗N ||σ⊗N) = Tr ρ⊗N log ρ⊗N − Tr ρ⊗N log σ⊗N .

The second term Tr ρ⊗N log σ⊗N is unchanged if we replace ρ⊗N

by its counterpart ρ⊗ND that is diagonal in the same basis as σ⊗N .So

S(ρ⊗N |σ⊗N)− S(ρ⊗ND |σ⊗N) = Trρ⊗N log ρ⊗N − Trρ⊗ND log ρ⊗ND .

The reason that we will get simplification for large N is that ρ⊗N

commutes with the group of permutations of the N factors of HN .Therefore, ρ⊗N is block diagonal in a basis of irreduciblerepresentations of the permutation group. Although H⊗N has anexponentially large dimension kN for large N (where k = dimH),the irreducible representations have much smaller dimensions, withan upper bound of the form aNb, where a and b depend on k butnot on N.

Page 88: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

In a basis of irreducible representations of the permutation group,ρ⊗N is block diagonal

ρ⊗N =

p1ρ1

p2ρ2p3ρ3

. . .

where in the i th block ρi is a density matrix on the i th irreduciblerepresentation, and

∑i pi = 1. In a basis in which σ⊗N is

diagonalized within each block, ρ⊗ND is obtained by replacing eachof the ρi with is diagonal version ρi ,D :

ρ⊗N =

p1ρ1,D

p2ρ2,Dp3ρ3,D

. . .

Page 89: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

One finds then

Trρ⊗N log ρ⊗N − Trρ⊗ND log ρ⊗ND =∑i

pi (S(ρiD)− S(ρi )). (∗)

Any density matrix on an n-dimensional space has an entropy Sbounded by 0 ≤ S ≤ log n. Because the sizes of the blocks arebounded by aNb, and

∑i pi = 1, the right hand side of (∗) is

bounded in absolute value by log(aNb), which for large N isnegligible compared to N.

Page 90: Introduction To Information Theory · want to explain, and this will more or less conclude our introduction to classical information theory. Suppose that x and y are two random variables

Putting these facts together, for large N, a measurement thatprojects onto 1-dimensional eigenspaces of σ within each block isgood enough to saturate the bound

2−NScl(R||S) ≥ 2−NS(ρ||σ)

within an error of order logN in the exponent. This confirms thatquantum relative entropy has the same interpretation as classicalrelative entropy: it controls the ability to show, by a measurement,that an initial hypothesis is incorrect.

(I followed a paper by M. Hayashi.)