Some Intuition Around Entropy

[February 23, 2018]

(Only interesting if you already know some things about information theory, probably)
(Disclaimer: Notes. Don’t trust me, I’m not, like, a mathematician.)

I have been reviewing concepts from Information Theory this week, and I’ve realized that I never quite really understood what (Shannon) Entropy was all about.

Specifically: I have finally understood how entropy is not a property of probability distributions per se, but a property of streams of information. When we talk about ‘the entropy of a probability distribution’, we’re implicitly talking about the stream of information produced by sampling from that distribution. Some of the equations make a lot more sense when you keep this in mind.

1 Entropy

Information in the information theoretic sense is a numerical quantity: ‘the number of bits required to store data’. For example, an 8-bit value (an integer between 0 and 255) takes 8 bits of data to store or transmit. A single binary digit, or the outcome of a coin flip encoded as a binary digit, takes 1 bit of data.

Information is typically measured in bits, which is just a unit, like meters or radians. If a single variable can take \(n\) possible values, it takes \(\log_2 (n)\) bits to represent. ‘Bits’ is just a unit, and you can easily use others, which corresponds to using \(\ln (n)\) (‘nats’) or \(\log_{10} (n)\) (‘dits’). As with most other values in math, it turns out to be totally possible to talk about fractional bits, so a value can take \(1.5\) bits to represent, although making sense of that takes a bit of work.

Entropy is often confused with Information but it’s slightly different: it is the average amount of Information which is learned by learning an unknown value, such as the outcome of a random variable that’s selected from a probability distribution. It’s a functional1 which maps a probability distribution to a quantity of Information. It usually takes the symbol \(H\) in information theory or \(S\) in physics. I’ll use \(H\).

The formula for the entropy of a discrete probability distribution \(X\) is:

\[H[p] = -\sum_{x} p(x) \log p(x)\]

This is, however, obtuse. It is better thought of as the ‘expected value of \(-\log p(x)\)’:

\[H[p] = \bb{E}[- \log p(x)]\]

\(- \log p(x)\) is a way of writing the amount of information that comes from seeing a value \(x\) with probability \(p(x)\). It’s better written \(\log \frac{1}{p(x)}\). This somehow captures the amount of information that we learn if we get the result \(x\), and so entropy somehow measures the expected value of information when sampling from this distribution.

Note that the negative sign is misleading: since \(p(x)\) is always less than 1, \(\log p(x) < 0\). Also, since \(0 \log 0 = 0\), because \(\lim_{p\ra 0^{+}} p \log p = \lim_{N \ra \infty} \frac{-\log N}{N} = 0\), 0-probability events never happen and contribute nothing to entropy.

Some examples: the entropy of a fair coin flip is, as expected, 1 bit:

\[H = - p(\tt{H}) \log p(\tt{H})- p(\tt{T}) \log p(\tt{T}) = -\frac{1}{2} \log \frac{1}{2} -\frac{1}{2} \log \frac{1}{2} = 1\]

The entropy of a biased coin that has \(p(\tt{H}) = \frac{3}{4}\) is:

\[H = - \frac{3}{4} \log \frac{3}{4} - \frac{1}{4} \log \frac{1}{4} = 2 - \frac{3}{4} \log 3 \approx 1.642\]

I’m not going to go into it in detail, but the meaining of fractional bits in this case is best interpreted like this: as we make a sequence of \(N \ra \infty\) outcomes from this unfair coin (with \(\ra \frac{3}{4}N\) heads), it will be possible to compress that sequence into a sequence of length \(\ra 1.642N\) bits, since that is another way to communicate the same volume of information. I don’t know how you would encode it in this way, but roughly you would use some kind of alternate representation than heads+tails that builds in the fact that heads occur more often and so saves space by assuming the next value will be a head.


2 Microstates vs Macrostates

The discrete entropy formula \(H[p] = \bb{E}[- \log p]\) is usually derived from some axioms about how it should behave, which I find useless for intuition. Instead, here’s the easiest way to see way to see why it’s defined that way.

Imagine a space of outcomes of some event, like “the results of \(X\) fair coin flips”, which we will describe using two different descriptions, one of macrostates which describes how many \(\tt{H}\) and \(\tt{T}\) were seen, and one of microstates, which describes the actual underlying sequence of results.

  • At the microstate level, the system has \(M = 2^{X}\) equiprobable outcomes, which takes \(X\) bits of data to specify an outcome: the exact value of each coin flip.
  • At the macrostate level, there is a probability distribution that captures the chance \(p(x \tt{ Heads})\) of getting exactly \(x\) heads, given by \(p(x) = \binom{X}{x}/M\).

Specifically, the chance of \(x\) Heads is:

\[p(x \tt{ Heads}) = \frac{\langle \tt{number of ways to get exactly x Heads} \rangle}{\langle \tt{possible outcomes}\rangle}\]

Let’s write the numerator as \(\binom{X}{x} = m_x\), so \(p(x) = \binom{X}{x}/M = \frac{m_{x}}{M}\). Thus, each macrostate “\(x\) Heads” corresponds to \(m_{x}\) possible microstates, out of the \(M\) total possible microstates, with \(p(x \; \tt{Heads}) = \frac{m_{x}}{M}\). It’s clear that \(\sum m_{x} = \sum \binom{X}{x}\) has to equal \(M = 2^{X}\) for this to be a probability distribution, and it does.

In this case we can clearly write down the entropy of learning a specific microstate (it’s \(\log M\)) or any particular macro state (it’s \(\log m_x\)). Thus, if we learn a macrostate \(x\), we have learned \(\log m_x\) bits of information, and there are \(\log M - \log m_x\) bits left which we still do not know. So the entropy of the macrostate distribution is the difference:

\[\begin{aligned} H[p(x)] &= \sum_{x} p(x) (\log M - \log m_x) \\ &= \sum_x p(x) \log \frac{M}{m_x} \\ &= - \sum_x p(x) \log \frac{m_x}{M} \\ &= - \sum_x p(x) \log p(x) \\ &= \bb{E}[ - \log p(x) ]\end{aligned}\]

This is basically the idea. The underlying distrubution takes \(\log M\) bits to specify, and after various outcomes we are left with, on average, \(\log m_x\) bits left to learn, so we learn, on average, \(\bb{E}[\log M - \log m_x]\) bits. Basically:

\[H[p(x)] = \bb{E}[\langle \tt{information learned} \rangle] \\ = \langle \tt{total information} \rangle - \bb{E}[\langle \tt{remaining unknown information} \rangle]\]

We see that the quantity \(I(x) = - \log p(x) = \log \frac{1}{p(x)}\) term can be understood as: “if you tell me we’re in state x, I still have to learn \(\log m_{x}\) bits of data to get to the exact microstate, meaning that you just told me \(\log M - \log m_{x}\) bits of data.” (This quantity is called the self-entropy or “surprisal” of outcome \(x\).)


This shows that the two \(p\)s in the usual expression of entropy mean different things. Really, the usual formula is just a particular expression for \(\bb{E}[I(x)]\), where \(x \in X\) is an event that can happen with probability \(p\). I think it should be viewed as incidental that the expression for \(I(x)\) happens to include \(p(x)\) also: there’s nothing about \(H = \bb{E}[\tt{information}]\) that requires the information come in the form of probabilistic events. There’s a probability distribution for how much information we learn, but it doesn’t matter where that information comes from.

For instance: if I just decide to tell you 1 bit of information with probability \(\frac{1}{2}\) and 100 bits with probability \(\frac{1}{2}\)”, you can still calculate an entropy \(H = \frac{1}{2}(1 + 100) = 50.5\) bits. It doesn’t matter that those bits themselves come from occurences of events with probabilities. It’s just any information.

For a more interesting example: there’s no requirement that the macrostate descriptions be exclusive, as probability distributions are. If I flip 100 coins, maybe I’ll tell you “whether I got more than 50 heads” (1 bit of information) and “the exact sequence of results” (100 bits of information) with equal probability. That’s still \(50.5\) bits of data, but it’s not a probability distribution over the set of possible macrostate descriptions – some of the macrostates overlap.

(This is well-known but I had not really understood it before today, and I think it could use more emphasis. In fact it was Shannon’s original model anyway.)

I think it is appropriate to view entropy not as “the data required to encode a stream of events”, but “the data required to re-encode a stream into another language”:

  • \(H[p]\) tells us how much information we learn from moving from the ‘trivial language’ (“something happened!”) to another language (“there are \(x\) heads”).
  • \(I(x)\) is the remaining information required to move from a macrostate language to a microstate one (“the exact sequence was…”).
  • there is no reason we couldn’t change together more description languages.
  • the unit of “bits” means “the data required to re-encode this sequence of labeled events into binary code”.

This interpretation becomes important when we try to generalize our equation for entropy to continuous distributions, because we can’t encode a stream of infinite-precision events perfectly.


3 The Continuous Analog of Entropy

The naive way to generalize \(- \sum_x p(x) \log p(x)\) to a continuous distribution \(p(x)\) would be

\[H[p] \stackrel{?}{=} - \int_{\bb{X}} p(x) \log p(x) dx\]

But we realize that this can’t be right, using the intuition from above. The real definition is \(\bb{E}[I(x)]\), and it’s no longer true that \(I(x) = - \log p(x)\) when \(x\) is continuous. The probability that \(x\) equals an exact value is \(0\), and so we would get that \(I(x) = - \log 0 = \infty\)! Specifying the exact value would take ‘infinite’ information.

The problem is that \(p(x)\) would be promoted to a density function, and we need to use \(p(x) dx\) to compute any actual probabilities. Suppose we take as our ‘events’ the occurence of \(x\) in a range \((a,b)\):

\[P[x \in (a,b)] = \int_{(a,b)} p(x) dx\]

Thus:

\[I(x \in (a,b)) = - \log \int_{(a,b)} p(x) dx\]

Which seems reasonable. But what if \((a,b) \ra (a,a)\)?

\[\lim_{\e \ra 0} - \log \int{(a, a + \e)} p(x) dx\]

if \(p(x)\) has antiderivative \(P(x)\), then \(\int_{(x,x+\e)} p(x') dx' = P(x + \e) - P(x) \approx \e p(x)\), giving:

\[I(x \in (a,a)) \stackrel{?}{=} - \log \e p(x) = \log \frac{1}{\e} - \log p(x) = \infty - \log p(x)\]

What does that mean?

Actually, it’s pretty reasonable. Of course it takes infinite information to zoom in completely on a point \(a\) – we have measured a value with infinite granularity, and thus gained infinite information! It seems as if there are two parts to the information: an infinite part, due to the “information of the continuum”, and a finite part, due to the non-uniformity of the probability distribution, which has the same \(-\log p(x)\) form that discrete information has.

Consider a uniform distribution \(\cal{U}(0, 1)\) on \((0,1)\), with density function \(1_{x \in (0,1)}\), it makes sense that “dividing the range in half” takes exactly 1 bit of data (to specify which half):

\[H[\cal{U}(0, \frac{1}{2})] = H[\cal{U}(0, 1)] - \log \frac{1}{2} = H[\cal{U}(0, 1)] + 1\]

And formulas like this one work even if specifying the exact value would take infinite data due to the continuum.


Since we can’t write down a finite value for \(U = H[\cal{U}(0, 1)]\), perhaps we can instead just reduce the distribution for \(P\) to a function of \(U\). We measure “how different \(p(x)\) is from uniform”, without fully reducing it the information required to specify exact values for \(x\). Suppose we partition the space into tiny buckets of width \(\D x\), and then later we will let the bucket size go to 0.

The uniform distribution \(\cal{U}(0,1)\) will contain \(\frac{1}{\D x}\) buckets, and thus have \(U(\D x) = - \log \D x = H_{\D x}[\cal{U}(0,1)]\) entropy. A uniform distribution \(\mathcal{U}(a,b)\) would contain \(\frac{b-a}{\D x}\) buckets, so specifying an exact bucket would take \(\log (b-a) - \log(\D x)\) information. Even if \(b-a < 1\), this is true for some sufficiently small bucket size. The first term is unaffected by the partitioning, and the second is \(U(\D x)\). As \(\D x \ra 0\), this becomes “entropy of the continuum”.

Thus, for a non-uniform distribution \(p(x)\), we just need to zoom in until \(p(x) \approx p(x + \D x)\), so that the distribution over each bucket is basically uniform. The information to specify a bucket of with \(\D x\) is:

\[\begin{aligned} I(x, x+\D x) &= - \log P[x, x+\D x] \\ &= - \log \int_{(x, x + \D x)} p(x') dx' \\ &\approx - \log (p(x) \D x) \\ &= - \log p(x) + U(\D x) \end{aligned}\]

And therefore the entropy of a continuous distribution can be given by:

\[\begin{aligned} H[p] &= \bb{E}[I(x)] \\ &= \int p(x) (- \log p(x) + U(\D x)) dx \\ &= - \int p(x) \log p(x) dx + U \\ &= - \int p(x) \log p(x) dx + H[\cal{U}(0,1)] \end{aligned}\]

The first term is the “naive” entropy formula. The second is the “entropy of the continuum”, and must be ignored. Put differently, we can’t truly compute \(H[p]\); what we actually compute is:

\[- \int p(x) \log p(x) dx = H[p] - H[\cal{U}(0,1)]\]

This is the entropy associated with “changing languages” into a bunch of tiny buckets of arbitrarily small width. Or, if you prefer, it’s the change of variables which makes \(p(x)\) by distorting the \(x\) axis. The same argument seems to work if our variable isn’t continuous, but is just granular at a much smaller length scale than we’re can see. Then we could find a meaningful value for the second term, but we’ll still want to discard it.)

This value \(- \int p(x) \log p(x) dx\) is called differential entropy, and it’s not the limit of discrete formula, but as we see it is still meaningful.

Note that differential entropy can be negative, because it’s easy to write down a probability distribution on a continuous space that has less entropy than \(\cal{U}(0,1)\): for example, any \(\cal{U}(a,b)\) with \((b-a) < 1\). I assume that saying distribution \(A\) has negative entropy relative to distribution \(B\) means that you encode \(B\) into \(A\) efficiently, rather than the other way around!

An important point: the differential entropy is not invariant under changing variables for \(p(x)\). Why? Because it’s relative to a uniform distribution on the variable \(x\). The differential entropy on another variable \(u = f(x)\) may rescale \(\cal{U}(0,1)\). We would need to also compute the entropy due to rescaling \(H[\cal{U}(0,1)]\). The most we can say is that:

\[H[p(y)] - H[\mathcal{U}(y)] = H[p(x)] - H[\mathcal{U}(x)]\]

Which does not mean that \(H[p(y)] = H[p(x)]\).

If this wasn’t so long, I’d go on and talk about relative entropy, which generalizes this idea of the entropy of a distribution relative to another. I also need to learn it first. Maybe another time!


4 Final Thoughts

I think it’s funny that entropy seems so important to us. The entire theory is just, like, ‘the logarithm of probability theory’. If you have two probability spaces that take \(A\) and \(B\) possible states, you could write that the compression ratio is \(R = \frac{A}{B}\), or you could write that the entropy is \(\log R = \log A - \log B\). If you have two encodings which multiply information density by \(X\) and \(Y\), they compose to multiply by \(XY\), and thus reduce information by \(-\log X - \log Y\). Etc. They really are the same thing.

A lot of this suggests that the the right way to combine probability distributions (or re-encodings) is to multiply their probability distributions, and the reason this uses so many logarithms is that we are pretty good at adding numbers together in expectations (cf integrals, expectation values), and not very good at multiplying them. We’d really like some sort of ‘multiplicative’ expectation value, but we aren’t used to using that so we take logarithms to make everything easier to write!

In an expectation \(\bb{E}[x]\) we compute \(\sum x p(x)\). The ‘multiplicative expectation’ operation, \(\bb{G}[p]\), would presumably multiply everything together weighted by its probability:

\[\bb{G}[x] = \prod x^{p}\]

On a discrete uniform distribution this is just the geometric mean. For a continuous analog would would need a sort of ‘multiplicative integral’, which are a thing, albeit an obscure one. One strong reason for this is that Bayes’ rule in probability is already a multiplication (of probabilities expressed as odds \((a:b) \times (c:d) = (ab : cd)\)), so there are other places where it seems like probability is more naturally expressed in multiplicative terms than additive ones.

Entropy in terms of \(\bb{G}\) is just \(\log \bb{G}[p^{-1}] = \bb{E}[-\log p]\), and we could define everything in this page using \(\bb{G}\), it would just look less familiar. Moreover, I assume there are similar versions for any other generalized mean.

Basically I suspect that we use logarithms to write things in bits, flattening all our ratios out into sums, only because we happen to be better at addition than multiplication. Something to think about.


Also, I wonder if more can be said about exactly how we map languages into each other. When we say a space \(P\) can be represented by \(H[P]\) bits, we literally mean it: there exists a function with the property of mapping strings of bits to elements with an average ratio of \(H[P]\), and no function can exist that does better than that. But I’m curious about trying to implement these mappings on, say, different kinds of formal languages. For instance – the space of sequences of coin flips until tails \(\{ \tt{T}, \tt{HT}, \tt{HHT}, \ldots\}\) clearly is represented by a regular language H*T. Maybe we can make some kind of more correspondence between a state machine of binary strings to a state machine of this sequence as the encoding.


Also, I’m annoyed that the seemingly coherent concept of differential entropy makes yet another example where it seems that our fear of infinities is a real problem. Like, can’t we find a better way to handle them than carefully trodding through limits and trying to cancel them out?


Also, I wonder if the quantum mechanical version of entropy is more easily understood in terms of being a description ‘relative to a uniform distribution’, like I did above. Because, uh, everyone seems pretty happy just throwing their hands up when they see negative values for differential entropy without trying to interpret them. Ah well.

  1. A function that acts on functions. It’s common to write them with square brackets, \(H[]\) instead of \(H()\), to remind of this. The expectation value \(\bb{E}[]\) is another functional.