Some Intuition Around Entropy

[February 23, 2018]

(Only interesting if you already know some things about information theory, probably)
(Disclaimer: Notes. Don’t trust me, I’m not, like, a mathematician.)

I have been reviewing concepts from Information Theory this week, and I’ve realized that I never quite really understood what (Shannon) Entropy was all about.

Specifically: I have finally understood how entropy is not a property of probability distributions per se, but a property of streams of information. When we talk about ‘the entropy of a probability distribution’, we’re implicitly talking about the stream of information produced by sampling from that distribution. Some of the equations make a lot more sense when you keep this in mind.

1 Entropy

Recall that Information measures ‘the number of bits required to store data’. For example, an 8-bit number takes 8 bits of data to store or transmit. A binary digit, or the outcome of a coin flip encoded as a binary digit, takes 1 bit of data. The unit of bits is just a unit (corresponding to logarithms base-2); you can easily use others.

Meanwhile, Entropy is the average amount of Information which is learned by learning an unknown value, such as the outcome of a random variable that’s selected from a probability distribution. It’s a functional1 which maps a probability distribution to a quantity of Information. It usually takes the symbol in information theory or in physics. I’ll use .

The formula for the entropy of a discrete probability distribution is:

This is better thought of as the ‘expected value of ’:

I claim that somehow captures the amount of information that we learn if we get the result , and so entropy somehow measures the expected value of information when sampling from this distribution.

Note that this equation for entropy is always positive. The negative sign is misleading: since is always less than 1, . It would probably be better to always write to emphasize this, but that’s not usually done. Also, since , because , 0-probability events never happen and contribute nothing to entropy.

Some examples: the entropy of a fair coin flip is, as expected, 1 bit:

The entropy of a biased coin that has is:

This means, intuitively, that as we make a sequence of outcomes from this unfair coin (with Hs), it will be possible to compress that sequence into a sequence of bits, since that is another way to communicate the same volume of information.

2 Microstates vs Macrostates

The discrete entropy formula is usually derived from some axioms about how it should behave, which I find useless for intuition. Instead, here’s the easiest way to see way to see why it’s defined that way.

Imagine a space of outcomes of some event, like “the results of fair coin flips”, which we will describe using two different descriptions, one of macrostates which describes how many and were seen, and one of microstates, which describes the actual underlying sequence of results.

  • At the microstate level, the system has equiprobable outcomes, which takes bits of data to specify an outcome: the exact value of each coin flip.
  • At the macrostate level, there is a probability distribution that captures the chance of getting exactly heads, given by .

Specifically, the chance of Heads is:

Let’s write the numerator as , so . Thus, each macrostate “ Heads” corresponds to possible microstates, out of the total possible microstates, with . It’s clear that has to equal for this to be a probability distribution, and it does.

In this case we can clearly write down the entropy of learning a specific microstate (it’s ) or any particular macro state (it’s ). Thus, if we learn a macrostate , we have learned bits of information, and there are bits left which we still do not know. So the entropy of the macrostate distribution is the difference:

This is basically the idea. The underlying distrubution takes bits to specify, and after various outcomes we are left with, on average, bits left to learn, so we learn, on average, bits. Basically:

We see that the quantity term can be understood as: “if you tell me we’re in state x, I still have to learn bits of data to get to the exact microstate, meaning that you just told me bits of data.” (This quantity is called the self-entropy or “surprisal” of outcome .)

This shows that the two s in the usual expression of entropy mean different things. Really, the usual formula is just a particular expression for , where is an event that can happen with probability . I think it should be viewed as incidental that the expression for happens to include also: there’s nothing about that requires the information come in the form of probabilistic events. There’s a probability distribution for how much information we learn, but it doesn’t matter where that information comes from.

For instance: if I just decide to tell you 1 bit of information with probability and 100 bits with probability ”, you can still calculate an entropy bits. It doesn’t matter that those bits themselves come from occurences of events with probabilities. It’s just any information.

For a more interesting example: there’s no requirement that the macrostate descriptions be exclusive, as probability distributions are. If I flip 100 coins, maybe I’ll tell you “whether I got more than 50 heads” (1 bit of information) and “the exact sequence of results” (100 bits of information) with equal probability. That’s still bits of data, but it’s not a probability distribution over the set of possible macrostate descriptions – some of the macrostates overlap.

(This is well-known but I had not really understood it before today, and I think it could use more emphasis. In fact it was Shannon’s original model anyway.)

I think it is appropriate to view entropy not as “the data required to encode a stream of events”, but “the data required to re-encode a stream into another language”:

  • tells us how much information we learn from moving from the ‘trivial language’ (“something happened!”) to another language (“there are heads”).
  • is the remaining information required to move from a macrostate language to a microstate one (“the exact sequence was…”).
  • there is no reason we couldn’t change together more description languages.
  • the unit of “bits” means “the data required to re-encode this sequence of labeled events into binary code”.

This interpretation becomes important when we try to generalize our equation for entropy to continuous distributions, because we can’t encode a stream of infinite-precision events perfectly.

3 The Continuous Analog of Entropy

The naive way to generalize to a continuous distribution would be

But we realize that this can’t be right, using the intuition from above. The real definition is , and it’s no longer true that when is continuous. The probability that equals an exact value is , and so we would get that ! Specifying the exact value would take ‘infinite’ information.

The problem is that would be promoted to a density function, and we need to use to compute any actual probabilities. Suppose we take as our ‘events’ the occurence of in a range :


Which seems reasonable. But what if ?

if has antiderivative , then , giving:

What does that mean?

Actually, it’s pretty reasonable. Of course it takes infinite information to zoom in completely on a point – we have measured a value with infinite granularity, and thus gained infinite information! It seems as if there are two parts to the information: an infinite part, due to the “information of the continuum”, and a finite part, due to the non-uniformity of the probability distribution, which has the same form that discrete information has.

Consider a uniform distribution on , with density function , it makes sense that “dividing the range in half” takes exactly 1 bit of data (to specify which half):

And formulas like this one work even if specifying the exact value would take infinite data due to the continuum.

Since we can’t write down a finite value for , perhaps we can instead just reduce the distribution for to a function of . We measure “how different is from uniform”, without fully reducing it the information required to specify exact values for . Suppose we partition the space into tiny buckets of width , and then later we will let the bucket size go to 0.

The uniform distribution will contain buckets, and thus have entropy. A uniform distribution would contain buckets, so specifying an exact bucket would take information. Even if , this is true for some sufficiently small bucket size. The first term is unaffected by the partitioning, and the second is . As , this becomes “entropy of the continuum”.

Thus, for a non-uniform distribution , we just need to zoom in until , so that the distribution over each bucket is basically uniform. The information to specify a bucket of with is:

And therefore the entropy of a continuous distribution can be given by:

The first term is the “naive” entropy formula. The second is the “entropy of the continuum”, and must be ignored. Put differently, we can’t truly compute ; what we actually compute is:

This is the entropy associated with “changing languages” into a bunch of tiny buckets of arbitrarily small width. Or, if you prefer, it’s the change of variables which makes by distorting the axis. The same argument seems to work if our variable isn’t continuous, but is just granular at a much smaller length scale than we’re can see. Then we could find a meaningful value for the second term, but we’ll still want to discard it.)

This value is called differential entropy, and it’s not the limit of discrete formula, but as we see it is still meaningful.

Note that differential entropy can be negative, because it’s easy to write down a probability distribution on a continuous space that has less entropy than : for example, any with . I assume that saying distribution has negative entropy relative to distribution means that you encode into efficiently, rather than the other way around!

An important point: the differential entropy is not invariant under changing variables for . Why? Because it’s relative to a uniform distribution on the variable . The differential entropy on another variable may rescale . We would need to also compute the entropy due to rescaling . The most we can say is that:

Which does not mean that .

If this wasn’t so long, I’d go on and talk about relative entropy, which generalizes this idea of the entropy of a distribution relative to another. I also need to learn it first. Maybe another time!

4 Final Thoughts

I think it’s funny that entropy seems so important to us. The entire theory is just, like, ‘the logarithm of probability theory’. If you have two probability spaces that take and possible states, you could write that the compression ratio is , or you could write that the entropy is . If you have two encodings which multiply information density by and , they compose to multiply by , and thus reduce information by . Etc. They really are the same thing.

A lot of this suggests that the the right way to combine probability distributions (or re-encodings) is to multiply their probability distributions, and the reason this uses so many logarithms is that we are pretty good at adding numbers together in expectations (cf integrals, expectation values), and not very good at multiplying them. We’d really like some sort of ‘multiplicative’ expectation value, but we aren’t used to using that so we take logarithms to make everything easier to write!

In an expectation we compute . The ‘multiplicative expectation’ operation, , would presumably multiply everything together weighted by its probability:

On a discrete uniform distribution this is just the geometric mean. For a continuous analog would would need a sort of ‘multiplicative integral’, which are a thing, albeit an obscure one. One strong reason for this is that Bayes’ rule in probability is already a multiplication (of probabilities expressed as odds ), so there are other places where it seems like probability is more naturally expressed in multiplicative terms than additive ones.

Entropy in terms of is just , and we could define everything in this page using , it would just look less familiar. Moreover, I assume there are similar versions for any other generalized mean.

Basically I suspect that we use logarithms to write things in bits, flattening all our ratios out into sums, only because we happen to be better at addition than multiplication. Something to think about.

Also, I wonder if more can be said about exactly how we map languages into each other. When we say a space can be represented by bits, we literally mean it: there exists a function with the property of mapping strings of bits to elements with an average ratio of , and no function can exist that does better than that. But I’m curious about trying to implement these mappings on, say, different kinds of formal languages. For instance – the space of sequences of coin flips until tails clearly is represented by a regular language H*T. Maybe we can make some kind of more correspondence between a state machine of binary strings to a state machine of this sequence as the encoding.

Also, I’m annoyed that the seemingly coherent concept of differential entropy makes yet another example where it seems that our fear of infinities is a real problem. Like, can’t we find a better way to handle them than carefully trodding through limits and trying to cancel them out?

Also, I wonder if the quantum mechanical version of entropy is more easily understood in terms of being a description ‘relative to a uniform distribution’, like I did above. Because, uh, everyone seems pretty happy just throwing their hands up when they see negative values for differential entropy without trying to interpret them. Ah well.

  1. A function that acts on functions. It’s common to write them with square brackets, instead of , to remind of this. The expectation value is another functional.