Some Intuition For Entropy

February 23, 2018

(Only interesting if you already know some things about information theory, probably)
(Disclaimer: Notes. Don’t trust me, I’m not, like, a mathematician.)

I have been reviewing concepts from Information Theory this week, and I’ve realized that I never quite really understood what Shannon Entropy was all about.

Specifically: entropy is not a property of probability distributions per se, but a property of translating between different representations of a stream of information. When we talk about “the entropy of a probability distribution”, we’re implicitly talking about the entropy of a stream of information that is produced by sampling from that distribution. Some of the equations make a lot more sense when you keep this in mind.

1. Entropy

Information in the information theoretic sense is a numerical quantity: ‘the number of bits required to store or transmit some data’. For example, an integer between 0 and 255 takes 8 bits of data to store or transmit, because 8 binary digits can identify which integer in that range you’re talking about. A single binary digit, or the outcome of a coin flip encoded as a binary digit, takes 1 bit of data.

If a single variable can take \(n\) possible values, it takes \(\log_2 (n)\) bits to represent. ‘Bits’ is just a unit, and you can easily use others, which corresponds to using \(\ln (n)\) (‘nats’) or \(\log_{10} (n)\) (‘dits’ or ‘hartleys’ for some reason). (In this article all logarithms will be base \(2\), though). As with most other values in math, it turns out to be totally possible to talk about fractional bits, even if that doesn’t initially seem sensible. A value can take \(1.5\) bits to represent, although making sense of that means takes a bit of work. A value that takes an integer number of bits will take an irrational number of dits or nats, though, so it is at least clear that the integer/non-integer status of the number has more to do with the choice of units than the data itself.

Entropy is often confused with information but it’s slightly different: it is the average amount of information which is learned by learning an unknown value from a source, such as the outcome of a random variable that’s selected from a probability distribution. It’s a functional which maps a probability distribution to a quantity of information (so it is written with square brackets, like \(H[p(x)]\)). It usually takes the symbol \(H\) in information theory or \(S\) in physics. I’ll use \(H\).

The standard formula for the entropy of a discrete probability distribution \(X\) is:

\[H[p] = -\sum_{x} p(x) \log p(x)\]

This is, however, obtuse. It is better thought of as the “expected value of \(-\log p(x)\)”:

\[H[p] = \bb{E}[- \log p(x)]\]

This somehow captures the amount of information that we learn if we get the result \(x\), and so entropy somehow measures the expected value of information when sampling from this distribution. The quantity \(- \log p(x)\) represents the amount of information that comes from seeing a value \(x\) with probability \(p(x)\). I prefer to get rid of the negative sign by folding it into the logarithm:

\[H[p] = \bb{E}[\log \frac{1}{p(x)}]\]

Either way you express it, it is not that obvious why this \(\log \frac{1}{p}\) quantity represents the information in an event. I’ll go into that in the next section. In the meantime, note that since \(p(x)\) is always \(\leq 1\), \(\frac{1}{p(x)}\) is \(\geq 1\) and \(\log \frac{1}{p(x)}\) is always \(\geq 0\). For the case where \(p(x) = 0\), that is, for events that never happen, the contribution to entropy is zero because the limit ends up canceling it out (due to \(\log N\) growing slower than \(N\)):

\[0 \log 0 = \lim_{p\ra 0^{+}} p \log p = \lim_{N \ra \infty} \frac{1}{N} \log \frac{1}{N} = \lim_{N \ra \infty} \frac{-\log N}{N} = 0\]

Here are a couple examples:

The entropy of a fair coin flip is, as expected, \(1\) bit:

\[H = - p(\tt{H}) \log p(\tt{H})- p(\tt{T}) \log p(\tt{T}) = -\frac{1}{2} \log \frac{1}{2} -\frac{1}{2} \log \frac{1}{2} = 1\]

The entropy of a biased coin that has \(p(\tt{H}) = \frac{3}{4}\) is:

\[\begin{aligned} H &= - \frac{3}{4} \log \frac{3}{4} - \frac{1}{4} \log \frac{1}{4} \\ &= - \frac{3}{4} (\log 3 - \log 4) - \frac{1}{4} (\log 1 - \log 4) \\ &= - \frac{3}{4} (\log 3) + 2 \\ &\approx 0.81 \text{ bits} \end{aligned}\]

The simplest interpretation of the meaning of a “fractional bit” in this case is this: as we make a sequence of \(N \ra \infty\) outcomes from this unfair coin (with \(\ra \frac{3}{4}N\) heads), it will be possible to compress that sequence into a sequence of length \(\ra 0.81N\) bits, since that is another way to communicate the same volume of information. I don’t know how you would encode it in this way, but roughly you would use some kind of alternate representation than heads+tails that builds in the fact that heads occur more often and so saves space by assuming the next value will be a head. In any case, asymptotically there is a way to store the results of 1000 flips of a \(p(\tt{H}) = \frac{3}{4}\) coin in approximately \(810\) binary digits.

The entropy of a uniform distribution over \(N\) values is \(\log_2 N \text{ bits}\):

\[H = \sum_N p \log \frac{1}{p} = \sum_N \frac{1}{N} \log_2 N = \log_2 N\]

For instance a uniform distribution over \(32\) values is \(\log_2 32 = 5\) bits, corresponding to the fact that the \(32\) values can be enumerated with five binary values.


2. Microstates and Macrostates

The discrete entropy formula \(H[p] = \bb{E}[\log \frac{1}{p}] = \sum p \log \frac{1}{p}\) is often derived from some axioms about how it should behave, which I find useless for intuition. I prefer instead to define entropy as “expected information”, and then the thing we need to explain is why the information of an event with probability \(p\) would be \(\log \frac{1}{p}\). Here’s what I think is the easiest way to see way to see why it’s defined that way.

To begin, I think the fact that the entropy of a discrete uniform distribution over a power of two number of possibilities is \(H[\mathcal{U}(N)] = \log N\) is unobjectionable. There are \(N\) possible values so the total information to label one is \(\log N\) bits. Fine.

So what we do is we build the non-uniform distribution on top of a much larger uniform distribution \(\mathcal{U}(N)\). Suppose we have some non-uniform-probably distribution \(p(x)\) of \(K\) total events \(E_1, E_2, \ldots E_K\). We imagine that there is really some huge number \(N >> K\) total possible events \(y \in (1, 2, \ldots N)\) which are uniformly distributed, with probability distribution \(q(y) = \frac{1}{N}\) each, where we call the first \(p(E_1) N\) of them \(E_1\), the second \(p(E_2) N\) of them \(E_2\), etc.

For instance, flipping a coin is a distribution over two values, \(\{ \tt{H}, \tt{T} \}\), but under the hood there are however many gazillion possible ways to flips it a physical level (up to the precise motion of the coin and the air and the vibrations at the end, etc) which give those two results. So we think of the non-uniform distribution \(\{ p, 1 - p \}\) in the two values as “really” being a implemented by a uniform distribution over \(N\) values with roughly \(pN\) of them turning out \(\tt{H}\) and roughly \((1-p)N\) of them turning out \(\tt{T}\).

Of course the values \(p(x) N\) may not necessarily be integers, but if we pick \(N\) large enough than it won’t matter. We can treat them all as being rounded to nearby integers carefully, such that \(\sum p(x) N = N\) (instead of being off by one or something), and for large enough \(N\) it won’t make a difference.

We call the \(p(x)\) distribution the “macrostates”, which take values in \(E_1, E_2, \ldots\), and the \(q(y)\) distribution the “microstates” which take any of the \(N\) possibly underlying values (to borrow some terminology from physics). So within each \(x\) macrostate, there are \(p(x) N\) uniformly distributed microstates, which because they’re uniform distributions over all their possible values, definitely each have entropy

\[\log (p(x) N) = \log p(x) + \log N \text{ bits}\]

Note that the \(p(x)\) are less than \(1\) so the \(\log p(x)\) are negative.

Now we reason:

Identifying one microstate out of the \(N\) is \(\log N\) bits of information. Once we’ve identified that we’re in \(x\), identifying a microstate is \(\log (p(x) N) = \log p(x) + \log N\) bits of information. Hence, learning that we’re in \(E_i\) with probability \(p(x)\) ought to be \(\log N - (\log p(x) + \log N) = - \log p(x)\) bits of information. This is called the “self-information” or “surprisal” \(I(x)\) of the state \(x\): learning that state in particular is \(I(x)\) bits of information. The entropy is the average information learned from any of the \(x\), so it’s

\[H[p(x)] = \bb{E}[I(x)] = \bb{E}[- \log p(x)] = \sum p(x) \log \frac{1}{p(x)}\]

which gives the entropy of a non-uniform distribution, or a uniform distribution over a non-power-of-two number of values.

Now, the trick in here is that we glossed over the details if the distribution doesn’t divide evenly into \(N\). But I think that’s fine. In reality you can’t tell the difference between a non-uniform distribution of a few values and a uniform distribution over some massive number of underlying microstates. If there are \(2^{50}\) ways that a coin flip can turn out in the low-level physics, you don’t really care if the number of them that count as heads is precisely \(2^{49}\) of them versus off by a bit: you can always pick \(M\) large enough that the resulting ratios are within rounding errors. Fine by me.

So at a high-level the derivation looks like this:

\[\begin{aligned} H[p(x)] &= \bb{E}[\langle \text{information learned} \rangle] \\ &= \langle \text{total information} \rangle - \bb{E}[\langle \text{remaining unknown information} \rangle] \end{aligned}\]

Which you can also write as

\[H[p(x)] = \bb{E}[\langle \text{total information} \rangle - \langle \text{remaining unknown information}] = \bb{E}[\text{self-information}]\]

This shows that the two \(p\)s in the usual expression of entropy mean different things. Really, the usual formula is just a particular expression for \(\bb{E}[I(x)]\), where \(x \in X\) is an event that can happen with probability \(p\). I think it should be viewed as incidental that the expression for \(I(x)\) happens to include \(p(x)\) also: there’s nothing about \(H = \bb{E}[\text{information}]\) that requires the information come in the form of probabilistic events. There’s a probability distribution for how much information we learn, but it doesn’t matter where that information comes from.

For instance: if I just decide to tell you 1 bit of information with probability \(\frac{1}{2}\) and 100 bits with probability \(\frac{1}{2}\)”, you can still calculate an entropy \(H = \frac{1}{2}(1 + 100) = 50.5\) bits. It doesn’t matter that those bits themselves come from occurences of events with probabilities. It’s just any information.

For a more interesting example: there’s no requirement that the macrostate descriptions be exclusive, as probability distributions are. If I flip 100 coins, maybe I’ll tell you “whether I got more than 50 heads” (1 bit of information) and “the exact sequence of results” (100 bits of information) with equal probability. That’s still \(50.5\) bits of data, but it’s not a probability distribution over the set of possible macrostate descriptions—some of the macrostates overlap.

(This is well-known but I had not really understood it before today, and I think it could use more emphasis. In fact it was Shannon’s original model anyway.)

I think it is appropriate to view entropy not as “the data required to encode a stream of events”, but “the data required to re-encode a stream into another language”:

  • \(H[p(x)]\) tells us how much information we learn from moving from the ‘trivial language’ (“\(M\) coin flips happened happened”) to another language (“\(M\) coins were flipped and \(x\) were heads”).
  • \(I(x)\) is the remaining information required to move from a macrostate language to a microstate one (“the exact sequence was…”).
  • there is no reason we couldn’t change together more description languages.
  • the unit of “bits” means “the data required to re-encode this sequence of labeled events into binary code”.

This interpretation is especially helpful important when we try to understand what entropy means entropy to continuous distributions, because we can’t encode a stream of infinite-precision events perfectly.


3. The Continuous Analog of Entropy

The naive way to generalize \(- \sum_x p(x) \log p(x)\) to a continuous distribution \(p(x)\) would be

\[H[p] \stackrel{?}{=} - \int_{\bb{X}} p(x) \log p(x) dx\]

But we realize that this can’t be right, using the intuition from above. The real definition is \(\bb{E}[I(x)]\), and it’s no longer true that \(I(x) = - \log p(x)\) when \(x\) is continuous. The probability that \(x\) equals an exact value is \(0\), and so we would get that \(I(x) = - \log 0 = \infty\)! Specifying the exact value would take ‘infinite’ information.

The problem is that \(p(x)\) would be promoted to a density function, and we need to use \(p(x) dx\) to compute any actual probabilities. Suppose we take as our ‘events’ the occurence of \(x\) in a range \((a,b)\):

\[P[x \in (a,b)] = \int_{(a,b)} p(x) dx\]

Thus:

\[I(x \in (a,b)) = - \log \int_{(a,b)} p(x) dx\]

Which seems reasonable. But what if \((a,b) \ra (a,a)\)?

\[\lim_{\e \ra 0} - \log \int{(a, a + \e)} p(x) dx\]

Since \(\int_{(x,x+\e)} p(x') dx' \approx \e p(x)\) we get:

\[I(x \in (a,a)) \stackrel{?}{=} - \log \e p(x) = \log \frac{1}{\e} - \log p(x) = \infty - \log p(x)\]

What does that mean?

Actually, it’s pretty reasonable. Of course it takes infinite information to zoom in completely on a point \(a\)—we have measured a value with infinite granularity, and thus gained infinite information (I guess it’s encoded in all the decimal places)! It seems as if there are two parts to the information: an infinite part, due to the “information of the continuum”, and a finite part, due to the non-uniformity of the probability distribution, which has the same \(-\log p(x)\) form that discrete information has.

Consider a uniform distribution \(\cal{U}(0, 1)\) on \((0,1)\), with density function \(1_{x \in (0,1)}\). It makes sense that “dividing the range in half” takes exactly 1 bit of data (to specify which half you’re in):

\[H[\cal{U}(0, \frac{1}{2})] = H[\cal{U}(0, 1)] - \log \frac{1}{2} = H[\cal{U}(0, 1)] + 1\]

And formulas like this one work even if specifying the exact value would take infinite data due to the continuum.


Since we can’t write down a finite value for \(U = H[\cal{U}(0, 1)]\), perhaps we can instead just reduce the distribution for \(P\) to a function of \(U\). We measure “how different \(p(x)\) is from uniform”, rather than fully reducing it the information required to specify exact values for \(x\).

We partition the space into tiny buckets of width \(\D x\), and then later we will let the bucket size go to 0. The uniform distribution \(\cal{U}(0,1)\) will contain \(\frac{1}{\D x}\) buckets, and thus have \(U(\D x) = - \log \D x = H_{\D x}[\cal{U}(0,1)]\) entropy. A uniform distribution \(\mathcal{U}(a,b)\) would contain \(\frac{b-a}{\D x}\) buckets, so specifying an exact bucket would take \(\log (b-a) - \log(\D x)\) information. Even if \(b-a < 1\), this is true for some sufficiently small bucket size. The first term is unaffected by the partitioning, and the second is \(U(\D x)\). As \(\D x \ra 0\), this becomes “entropy of the continuum”.

Thus, for a non-uniform distribution \(p(x)\), we just need to zoom in until \(p(x) \approx p(x + \D x)\), so that the distribution over each bucket is basically uniform. The information to specify a bucket of with \(\D x\) is:

\[\begin{aligned} I(x, x+\D x) &= - \log P[x, x+\D x] \\ &= - \log \int_{(x, x + \D x)} p(x') dx' \\ &\approx - \log (p(x) \D x) \\ &= - \log p(x) + U(\D x) \end{aligned}\]

And therefore the entropy of a continuous distribution can be given by:

\[\begin{aligned} H[p] &= \bb{E}[I(x)] \\ &= \int p(x) (- \log p(x) + U(\D x)) dx \\ &= - \int p(x) \log p(x) dx + U \\ &= - \int p(x) \log p(x) dx + H[\cal{U}(0,1)] \end{aligned}\]

The first term is the “naive” entropy formula. The second is the “entropy of the continuum”, and must be ignored. Put differently, we can’t truly compute \(H[p]\); what we actually compute is:

\[- \int p(x) \log p(x) dx = H[p] - H[\cal{U}(0,1)]\]

This is the entropy associated with “changing languages” into a bunch of tiny buckets of arbitrarily small width. Or, if you prefer, it’s the change of variables which makes \(p(x)\) by distorting the \(x\) axis. The same argument seems to work if our variable isn’t continuous, but is just granular at a much smaller length scale than we’re can see. Then we could find a meaningful value for the second term, but we’ll still want to discard it.)

This value \(- \int p(x) \log p(x) dx\) is called differential entropy, and it’s not the limit of discrete formula, but as we see it is still meaningful.

Note that differential entropy can be negative, because it’s easy to write down a probability distribution on a continuous space that has less entropy than \(\cal{U}(0,1)\): for example, any \(\cal{U}(a,b)\) with \((b-a) < 1\). I assume that saying distribution \(A\) has negative entropy relative to distribution \(B\) means that you encode \(B\) into \(A\) efficiently, rather than the other way around!

An important point: the differential entropy is not invariant under changing variables for \(p(x)\). Why? Because it’s relative to a uniform distribution on the variable \(x\). The differential entropy on another variable \(u = f(x)\) may rescale \(\cal{U}(0,1)\). We would need to also compute the entropy due to rescaling \(H[\cal{U}(0,1)]\). The most we can say is that:

\[H[p(y)] - H[\mathcal{U}(y)] = H[p(x)] - H[\mathcal{U}(x)]\]

Which does not mean that \(H[p(y)] = H[p(x)]\).

If this wasn’t so long, I’d go on and talk about relative entropy, which generalizes this idea of the entropy of a distribution relative to another. I also need to learn it first. Maybe another time!


4. Final Thoughts

I think it’s funny that entropy seems so important to us. The entire theory is just, like, “the logarithm of probability theory”. If you have two probability spaces that take \(A\) and \(B\) possible states, you could write that the compression ratio is \(R = \frac{A}{B}\), or you could write that the entropy is \(\log R = \log A - \log B\). If you have two encodings which multiply information density by \(X\) and \(Y\), they compose to multiply by \(XY\), and thus reduce information by \(-\log X - \log Y\). Etc. They really are the same thing.

A lot of this suggests that the the right way to combine probability distributions (or re-encodings) is to multiply their probability distributions, and the reason this uses so many logarithms is that we are pretty good at adding numbers together in expectations (cf integrals, expectation values), and not very good at multiplying them. We’d really like some sort of ‘multiplicative’ expectation value, but we aren’t used to using that so we take logarithms to make everything easier to write!

In an expectation \(\bb{E}[x]\) we compute \(\sum x p(x)\). The ‘multiplicative expectation’ operation, \(\bb{G}[p]\), would presumably multiply everything together weighted by its probability:

\[\bb{G}[x] = \prod x^{p}\]

On a discrete uniform distribution this is just the geometric mean. For a continuous analog would would need a sort of ‘multiplicative integral’, which are a thing, albeit an obscure one. One strong reason for this is that Bayes’ rule in probability is already a multiplication (of probabilities expressed as odds \((a:b) \times (c:d) = (ab : cd)\)), so there are other places where it seems like probability is more naturally expressed in multiplicative terms than additive ones.

Entropy in terms of \(\bb{G}\) is just \(\log \bb{G}[p^{-1}] = \bb{E}[-\log p]\), and we could define everything in this page using \(\bb{G}\), it would just look less familiar. Moreover, I assume there are similar versions for any other generalized mean.

Basically I suspect that we use logarithms to write things in bits, flattening all our ratios out into sums, only because we happen to be better at addition than multiplication. Something to think about.


Also, I wonder if more can be said about exactly how we map languages into each other. When we say a space \(P\) can be represented by \(H[P]\) bits, we literally mean it: there exists a function with the property of mapping strings of bits to elements with an average ratio of \(H[P]\), and no function can exist that does better than that. But I’m curious about trying to implement these mappings on, say, different kinds of formal languages. For instance—the space of sequences of coin flips until tails \(\{ \tt{T}, \tt{HT}, \tt{HHT}, \ldots\}\) clearly is represented by a regular language H*T. Maybe we can make some kind of more correspondence between a state machine of binary strings to a state machine of this sequence as the encoding.


Also, I’m annoyed that the seemingly coherent concept of differential entropy makes yet another example where it seems that our fear of infinities is a real problem. Like, can’t we find a better way to handle them than carefully trodding through limits and trying to cancel them out?


Also, I wonder if the quantum mechanical version of entropy is more easily understood in terms of being a description “relative to a uniform distribution”, like I did above. Although maybe not. In QM there is a weird thing that happens for e.g. particles in a box where there actually is a “floor” to the entropy: it’s not relative to the continuum at all, because of the Heisenberg limit that says you can’t perfectly know a particle’s position and momentum at the same time. As you get infinite precision in the position coordinate you lose precision in the momentum coordinate, and there turns out to be a floor which is a sort of “absolute unit” of entropy, like a quantum system really only has a certain number of bits of data in it, not infinitely many (which is also why the entropies in statistical mechanics work out to be fixed values). Weird.