Alex Kritchevsky2021-04-26T08:39:28+00:00http://alexkritchevsky.com/blogAlex Kritchevskyalex.kritchevsky@gmail.comAll the Exterior Algebra Operations2020-10-15T00:00:00+00:00http://alexkritchevsky.com/2020/10/15/ea-operations<p>I’m returning to exterior algebra notes. This time, a reference for all of the many operations that I am aware of throughout the subject. It’s easier to define them all in one place than to spread the definitions over articles that use them.</p>
<p>I will make a point of giving explicit algorithms and an explicit example of each, in the lowest dimension that can still be usefully illustrative.</p>
<p>Warning: very long.</p>
<!--more-->
<hr />
<h2 id="background-on-terminology-and-notations">Background on terminology and notations</h2>
<p>As far as I can tell, the same ideas underlying what I call ‘exterior’ algebra have been developed at least four separate times in four notations. A rough history is:</p>
<p>Grassmann developed the original ideas in the ~1840s, particularly in his <em>Ausdehnungslehre</em>, which unfortunately was never very well popularized, particularly because linear algebra hadn’t really been developed yet. Grassmann’s goal was, roughly, to develop ‘synthetic geometry’: geometry without any use of coordinates, where all of the operations act on abstract variables.</p>
<p>Some of Grassmann’s ideas made it into projective geometry, where multivectors are called ‘flats’ (at least in one book I have, by Stolfi) and typically handled in projective coordinates (in which the point \((x,y)\) is represented by any value \((\lambda x, \lambda y, \lambda)\)). Some ideas also made it into algebraic geometry, and there is some overlap with ‘algebraic varieties’; I don’t know much about this yet.</p>
<p>Cartan and others develop the theory of differential forms in the 1920s and included a few parts of Grassman’s exterior algebra, which got the basics included in most algebra texts thereafter. Physicists adopted the differential forms notation for handling curved spaces in general relativity, so they got used to wedge products there. But most of vector calculus was eventually based on Hamilton’s quaternions from the ~1840s, simplified into its modern form by Heaviside in the ~1880s.</p>
<p>In the 1870s Clifford combined Hamilton and Grassmann’s ideas into ‘Clifford Algebras’, but they were largely forgotten in favor of quaternions and later vector analysis. Dirac accidentally re-invented Clifford algebras in the 1920s with the Dirac/gamma matrices in relativistic QM. Hestenes eventually figured this out and did a lot of work to popularize his ‘Geometric Algebra’ starting in the 1960s, and a small but vocal group of mostly physicists has been pushing for increased use of multivectors / GA since then. More on this later.</p>
<p>Rota and his students also discovered Grassmann at some point (the 1960s as well, I think?) and developed the whole theory again as part of what they called ‘invariant theory’, in which they called multivectors ‘extensors’. They have a lot of good ideas but their notations largely suck. Rota and co. also overlapped into ‘matroid’ theory, which deals with the abstract notion of linear dependence and so ends up using a lot of the same ideas.</p>
<p>So “multivectors”, “extensors”, and “flats” (and “matroids” in the context of real vector spaces) (and “varieties” in some cases?) basically are all the same thing. “Exterior product”, “wedge product”, “progressive product”, and “join” are all the same operation.</p>
<p>For the most part I greatly prefer notations and terminology based on vector algebra, so I stick with “multivector” and translate other things where possible. However, it is undeniable that the best name for the exterior product is the <strong>join</strong>, and its dual is the <strong>meet</strong>.</p>
<p>Everyone also picks their choice of scalar coefficients differently. I always pick the one that involves the fewer factorial terms, and I don’t care about making sure the choices generalize to finite fields.</p>
<p>Unfortunately, Cartan and the vector analysis folks definitely got the symbol \(\^\) for the exterior product wrong. Projective geometers and Rota got it right: it should be \(\vee\), rather than \(\^\). Join is to vector spaces what union is to sets, and union is \(\cup\). Meet (discussed below) is analogous to \(\cap\). (And linear subspaces form a lattice, which already uses the symbols \(\^\) and \(\v\) this way, plus the terminology ‘join’ and ‘meet’!)</p>
<p>I’m going to keep using \(\^\) for join here for consistency with most of the literature, but it’s definitely wrong, so here’s an open request to the world:</p>
<p><strong>If you ever write a textbook using exterior algebra that’s going to be widely-read, please fix this notation for everyone by swapping \(\^\) and \(\v\) back. Thanks.</strong></p>
<hr />
<h2 id="note-on-duality">Note on duality</h2>
<p>Since I am mostly concerned with eventually using this stuff for physics, I can’t ignore the way physicists handle vector space duality. The inner product of vectors is defined only between a vector and its dual, and contraction is performed using a metric tensor, so \(g: V \o V^* \ra \bb{R}\). In index notation this means you always pair a lower index with an upper one: \(\b{u} \cdot \b{v} = u_i v^i\).</p>
<p>However, I think most of this should be intuitive even on plain Euclidean space with an identity metric, so I prefer first presenting each equation with no attention paid to duality, then a version with upper and lower indices. I’ll mostly avoid including a metric-tensor version for space, but it can be deduced from the index-notation version.</p>
<p>An added complication is that there is an argument to be made that use of the dual vector space to define the inner product is a <em>mistake</em>. I am not exactly qualified to say if this correct or not, but after everything I’ve read I suspect it is. The alternative to vector space duality is to define everything in terms of the volume form, so the inner product is defined by the relation:</p>
\[\alpha \^ \star \beta = \< \alpha, \beta \> \omega\]
<p>With \(\omega\) a choice of pseudoscalar. This means that the choice of metric becomes a choice of <em>volume form field</em>, which is actually pretty compelling. \(\< \alpha, \_ \>\) <em>is</em> a linear functional \(\in V^* \simeq V \ra \bb{R}\), and so counts as the dual vector space. But this can also make it tricky to define \(\star\), since some people think it should map vectors to dual vectors and vice versa.</p>
<p>Another idea is to interpret \(V^*\) as a “-1”-graded vector space relative to \(V\), such that \(\underset{-1}{a} \^ \underset{1}{b} = \underset{0}{(a \cdot b)}\). ‘Dual multivectors’ then have negative grades in general. This often seems like a good idea but I’m not sure about it yet.</p>
<p>Rota’s Invariant Theory school uses yet another definition of the inner product. They define the wedge product in terms of another operation, called a ‘bracket’ \([, ]\), so that \(\alpha \^ \star \beta = [\alpha, \beta] \omega\), but they also seem to treat the pseudoscalar as a regular scalar and so call this an inner product. I don’t think this is the right approach because I’m not comfortable forgetting the difference between \(\^^n \bb{R}\) and \(\bb{R}\), although as above I do like the idea of the volume form as defining the inner product. (They call the whole space equipped with such a bracket a ‘Peano space’. I don’t think the name caught on.)</p>
<hr />
<h2 id="1-the-tensor-product-o">1. The Tensor Product \(\o\)</h2>
<p>We should briefly mention the tensor product first. \(\o\) is the ‘free multilinear product’ on vector spaces. Multilinear means that \(u \o v\) is linear in both arguments: \((c_1 u_1 + c_2 u_2) \o v = c_1 (u_1 \o v) + (c_2 u_2 \o v)\), etc. <a href="https://en.wikipedia.org/wiki/Free_object">Free</a> means that any other multilinear product defined on vector spaces factors through \(\o\). Skipping some technicalities, this means if we have some other operation \(\ast\) on vectors which is multilinear in its arguments, then there is an map \(f\) with \(a \ast b = f(a \otimes b)\).</p>
<p>‘Free’-ness is generally a useful concept. \(\^\) happens to be the free <em>antisymmetric</em> multilinear product, so any other antisymmetric operation on the tensor algebra factors through \(\^\). There are ‘free’-r products than \(\o\) as well, if you let go of multilinearity and associativity.</p>
<p>\(\o\) acting on \(V\) (a vector space over \(\bb{R}\)) produces the ‘tensor algebra’ consisting of consisting of \(\o V = \bb{R} \oplus V \oplus V^{\o 2} \oplus \ldots\), with \(\o\) as the multiplication operation. There is a canonical inner product on any \(V^{\o n}\) inherited from \(V\)’s: \(\< \b{a} \o \b{b}, \b{c} \o \b{d} \> = \< \b{a}, \b{c} \> \< \b{b} , \b{d} \>\).</p>
<hr />
<h2 id="2-the-exterior-product-">2. The Exterior Product \(\^\)</h2>
<p>The basic operation of discussion is the exterior product \(\alpha \^ \beta\). Its most general definition is via the quotient of the tensor algebra by the relation \(x \o x \sim 0\) for all \(x\). Specifically, the exterior <em>algebra</em> is the algebra you get under this quotient; the exterior <em>product</em> is the behavior of \(\o\) under this algebra homomorphism.</p>
<p>Given a vector space \(V\) and tensor algebra \(\o V\), we define \(I\) as the ideal of elements of the form \(x \o x\) (so any tensor which contains any copy of the same basis vector twice). Then:</p>
\[\^ V \equiv V / I\]
<p>Elements in this quotient space are multivectors like \(\alpha \^ \beta\), and \(\o\) maps to the \(\^\) operation. If \(\pi\) is the canonical projection \(V \mapsto V/I\):</p>
\[\pi(\alpha) \^ \pi(\beta) \equiv \pi(\alpha \o \beta)\]
<p>In practice, you compute the wedge product of multivectors by just appending them, as the product inherits associativity from \(\o\) (with \(\| \alpha \| = m, \| \beta \| = n\)):</p>
\[\alpha \^ \beta = \alpha_1 \^ \ldots \^ \alpha_{m} \^ \beta_1 \^ \ldots \^ \beta_n\]
<p>There is are several standard ways to map a wedge product back to a tensor product (reversing \(\pi\), essentially, so we’ll write it as \(\pi^{-1}\) although it is not an inverse). One is to select <em>any</em> valid tensor:</p>
\[\pi^{-1} \alpha \stackrel{?}{=} (\alpha_1 \^ \ldots \^ \alpha_n) = \alpha_1 \o \ldots \o \alpha_n\]
<p>More useful, however, it to map the wedge product to a totally antisymmetrized tensor:</p>
\[\pi^{-1} \alpha = K \sum_{\sigma \in S_{m}} \sgn(\sigma) \alpha_{\sigma(1)} \o \ldots \o \alpha_{\sigma(m)}\]
<p>Where \(\sigma\) ranges over the permutations of \(m\) elements. This has \(m!\) terms for a basis vector \(\in \^^m \bb{R}^n\) ( a more complicated formula with \({n \choose m}\) terms is needed for general elements of \(\^^m \bb{R}^n\) – but you can basically apply the above for every component). It is impractical for algorithms but good for intuition. \(K\) is a constant that is chosen to be either \(1\), \(\frac{1}{m!}\), or \(\frac{1}{\sqrt{m!}}\), depending on the source. I prefer \(K=1\) to keep things simple. Here’s an example:</p>
\[\pi^{-1}(\b{x} \^ \b{y}) = \b{x} \o \b{y} - \b{y} \o \b{x}\]
<p>Antisymmetric tensors that appear in other subjects are usually supposed to be multivectors. Antisymmetrization is a familiar operation in Einstein notation:</p>
\[\b{a} \^ \b{b} \^ \b{c} \equiv a_{[i} b_j c_{k]} = \sum_{\sigma \in S_3} \sgn(\sigma) a_{\sigma(1)} b_{\sigma(2)} c_{\sigma(3)}\]
<p>Other names:</p>
<ul>
<li>“Wedge product”, because it looks like a wedge</li>
<li>“Progressive Product” (by Grassmann and Gian-Carlo Rota). ‘Progressive’ because it increases grades.</li>
<li>“Join”, in projective geometry and lattice theory. So-called because the wedge product of two vectors gives the linear subspace spanned by them, if it is non-zero.</li>
</ul>
<p>As mentioned above, the symbol for ‘join’ in other fields is \(\vee\). Exterior algebra has it backwards. It’s definitely wrong: these operations in a sense generalize set-theory operations, and \(\^\) should correspond to \(\cup\).</p>
<hr />
<h2 id="3-the-inner-product--">3. The Inner Product \(\<, \>\)</h2>
<p>The multivector inner product, written \(\alpha \cdot \beta\) or \(\< \alpha, \beta \>\), where \(\alpha\) and \(\beta\) have the same grade.</p>
<p>There are several definitions that disagree on whether it should have any scaling factors like \(\frac{1}{k!}\), depending on the definition of \(\^\). I think the only reasonable definition is that \((\b{x \^ y}) \cdot (\b{x \^ y}) = 1\). This means that this is <em>not</em> the same operation as the <em>tensor</em> inner product, applied to antisymmetric tensors:</p>
\[(\b{x \^ y}) \cdot (\b{x \^ y}) \neq (\b{x \o y} - \b{y \o x}) \cdot (\b{x \o y} - \b{y \o x}) = 2\]
<p>But it’s just too useful to normalize the magnitudes of all basis multivectors. It avoids a lot of \(k!\) factors that would otherwise appear everywhere.</p>
<p>To compute, either antisymmetrize <em>both</em> sides in the tensor representation and divide by \(k!\), or just antisymmetrize one side (either one):</p>
\[\begin{aligned}
(\b{a \^ b}) \cdot (\b{c \^ d}) &= \frac{1}{2!}(\b{a \o b} - \b{b \o a}) \cdot (\b{c \o d} - \b{d \o c}) \\
&= (\b{a \o b}) \cdot (\b{c \o d} - \b{d \o c}) \\
&= (\b{a \cdot c}) (\b{b \cdot d}) - (\b{a \cdot d}) (\b{b \cdot c})
\end{aligned}\]
<p>This also gives the coordinate form:</p>
\[(\b{a \^ b}) \cdot (\b{c \^ d}) = a_i b_j c^{[i} d^{j]} = a_i b_j (c^i d^j - c^j d^i)\]
<p>Or in general:</p>
\[\< \alpha, \beta \> = \< \bigwedge \alpha_i, \bigwedge \beta_j \> = \det(\alpha_i \cdot \beta_j)\]
<hr />
<h2 id="4-the-interior-product-cdot">4. The Interior Product \(\cdot\)</h2>
<p>The interior product is the ‘curried’ form of the inner product:</p>
\[\< \b{a} \^\alpha, \beta \> = \< \alpha, \b{a} \cdot \beta \>\]
<p>This is written as either \(\b{a} \cdot \beta\) or \(\iota_{\b{a}} \cdot \beta\). Computation is done by antisymmetrizing the side with the larger grade, then contracting:</p>
\[\b{a} \cdot (\b{b \^ c}) = \b{a} \cdot (\b{b \o c} - \b{c \o b}) = (\b{a} \cdot \b{b}) \b{c} - (\b{a} \cdot \b{c}) \b{b}\]
<p>In index notation:</p>
\[\b{a} \cdot (\b{b \^ c}) = a_i b^{[i} c^{j]} = a_i (b^{i} c^{j} - b^j c^i)\]
<p>Other names: the “contraction” or “insertion” operator, because it inserts its left argument into some of the ‘slots’ in the inner product of the right argument.</p>
<p><strong>The two-sided interior product</strong></p>
<p>Normally, in the notation \(\alpha \cdot \beta\), it’s understood that the lower grade is on the left, and the operation isn’t defined otherwise. But some people ignore this restriction, and I’m warming up to doing away with it entirely. I can’t see any reason not to define it to work either way.</p>
<p>When tracking dual vectors we need to be careful about which side ends up ‘surviving’. To be explicit, let’s track which ones we are considering as dual vectors:</p>
\[\b{x}^* \cdot (\b{x} \^ \b{y}) = \b{y} \\
(\b{x}^* \^ \b{y}^*) \cdot \b{x} = \b{y}^*\]
<p>Note that in both cases the vectors contract <em>left-to-right</em>. One vector / dual-vector is inserted into the ‘slots’ of the other dual-vector/vector. In coordinates, these are:</p>
\[\b{a}^* \cdot (\b{b \^ c}) = a_i (b^{[i} c^{j]})\]
\[(\b{b}^* \^ \b{c}^*) \cdot \b{a} = a^i(b_{[i} c_{j]})\]
<hr />
<h2 id="5-the-hodge-star-star">5. The Hodge Star \(\star\)</h2>
<p>\(\star\) produces the ‘complementary subspace’ to the subspace denoted by a multivector. It is only defined relative to a choice of pseudoscalar \(\omega\) – usually chosen to be all of the basis vectors in lexicographic order, like \(\b{x \^ y \^ z}\) for \(\bb{R}^3\). Then:</p>
\[\star \alpha = \alpha \cdot \omega\]
<p>A more common but less intuitive definition:</p>
\[\alpha \^ (\star \beta) = \< \alpha, \beta \> \omega\]
<p>The inner product and Hodge star are defined in terms of each other in various sources. For my purposes, it makes sense to assume the form of the inner product.</p>
<p>In practice, I compute \(\star \alpha\) in my head by finding a set of basis vectors such at \(\alpha \^ \star \alpha = \omega\) (up to a scalar). Explicit example in \(\bb{R}^4\):</p>
\[\star(\b{w} \^ \b{y}) = - \b{x \^ z}\]
<p>because</p>
\[\b{(w \^ y) \^ x \^ z} = - \b{w \^ x \^ y \^ z} = - \omega\]
<p>In Euclidean coordinates, \(\omega\) is given by the Levi-Cevita symbol \(\epsilon_{ijk}\), and \(\star \alpha = \alpha \cdot \omega\) works as expected:</p>
\[\star(\b{a} \^ \b{b})_k = \epsilon_{ijk} a^i b^j\]
<p>This is using the convention that the \(\star\) of a vector is a lower-index dual vector. I’ve seen both conventions: some people would additionally map it back to a vector using the metric:</p>
\[\star(\b{a} \^ \b{b})^k = \epsilon_{ij}^k a^i b^j = g^{kl} \epsilon_{ijl} a^i b^j\]
<p>Either convention seems fine as long as you keep track of what you’re doing. They’re both valid in index notation, anyway; the only difference is choosing which is meant by \(\star \alpha\).</p>
<p>It is kinda awkward that \(\omega\) is the usual symbol for the pseudoscalar object but \(\e\) is the symbol with indices. It is amusing, though, that \(\e\) looks like a sideways \(\omega\). I’ll stick with this notation here but someday I hope we could just use \(\omega\) everywhere, since \(\e\) is somewhat overloaded.</p>
<p>\(\star\) is sometimes written \(\ast\), but I think that’s uglier. In other subjects it’s written as \(\star \alpha \mapsto \alpha^{\perp}\) which I do like.</p>
<p>We need a bit of notation to handle \(\star\) is arbitrary dimensions. We index with multi-indices of whatever grade is needed – for the Levi-Cevita symbol, we write \(\e_{I}\) where \(I\) ranges over the one value, \(\omega\), of \(\^^n V\) (note: this is different than ranging over <em>every</em> choice of \(I\) with \(n!\) terms. Instead, we index by a single multivector term. It’s a lot easier.) To express contraction with this, we split the index into two multi-indices: \(\e_{I \^ J}\), so \(\star \alpha\) is written like this:</p>
\[(\star \alpha)_{K} = \alpha^I \e_{I K}\]
<p>The implicit sum is over every value of \(I \in \^^{\| \alpha \|} V\).</p>
<p>Note that in general \(\star^2 \alpha = (-1)^k (-1)^{n-k} \alpha\), so \(\star^{-1} \alpha = (-1)^k (-1)^{n-k} \star \alpha\).</p>
<hr />
<h2 id="6-the-cross-product-times">6. The Cross Product \(\times\)</h2>
<p>The cross-product is only defined in \(\bb{R}^3\) and is given by:</p>
\[\b{a} \times \b{b} = \star (\b{a} \^ \b{b})\]
<p>Some people say there is a seven-dimensional generalization of \(\times\), but they’re misguided. This generalizes to every dimension.</p>
<hr />
<h2 id="7-the-partial-trace-cdot_k">7. The Partial Trace \(\cdot_k\)</h2>
<p>In index notation it is common to take a ‘partial trace’ of a tensor: \(c_i^k = a_{ij} b^{jk}\), and sometimes we see a partial trace of an antisymmetric tensor:</p>
\[c_j^k = a_{[i, j]} b^{[j, k]} = (a_{ij} - a_{ji})(b^{jk} - b^{kj}) = a_{ij} b^{jk} - a_{ji} b^{jk} - a_{ij} b^{kj} + a_{ji} b^{kj}\]
<p>For whatever reason I have never seen an coordinate-free notation for this for multivectors. But it’s actually an important operation, because if we treat bivectors as rotation operators on vectors, it’s how they compose:</p>
\[[(a \b{x} + b \b{y}) \cdot (\b{x \^ y})] \cdot (\b{x \^ y} ) = (a \b{y} - b \b{x}) \cdot (\b{x \^ y}) = - (a \b{x} + b \b{y})\]
<p>Which means that apparently</p>
\[R_{xy}^2 = (\b{x} \^ \b{y}) \circ (\b{x} \^ \b{y}) = -(\b{x \o x} + \b{y \o y})\]
<p>Note that the result <em>isn’t</em> a multivector. In general it’s an element of \(\^ V \o \^ V\).</p>
<p>But it’s still useful. What’s the right notation, though? Tentatively, I propose we write \(\cdot_k\) to mean contracting \(k\) terms together. The choice of <em>which terms</em> is a bit tricky. The geometric product, discussed later, suggests that we should do inner-to-outer. But the way we already handle inner products suggests left-to-right. For consistency let’s go with the latter, and insert \(-1\) factors as necessary.</p>
<p>The partial trace of two multivectors is implemented like this:</p>
\[\alpha \cdot_k \beta = \sum_{\gamma \in \^^k V} (\gamma \cdot \alpha) \o (\gamma \cdot \beta) \in \^ V \o \^ V\]
<p>Where the sum is over unit-length basis multivectors \(\gamma\). Note that this use of \(\o\) is <em>not</em> the multiplication operation in the tensor algebra we constructed \(\^ V\) from; rather, it is the \(\o\) of \(\^ V \o \^ V\). This translates to:</p>
\[[\alpha \cdot_k \beta]_{J K} = \alpha_{IJ} \beta^I_{K} = \delta^{IH} \alpha_{IJ} \beta_{HK}\]
<p>(That \(\delta\) is the identity matrix; recall that indexing it by multivectors \(I, H \in \^^k V\) means to take elements of \(\delta^{\^^k}\) which is the identity matrix on \(\^^k V\).)</p>
<p>This construction gives \((\b{x \^ y})^{(\cdot_1) ^2} = (\b{x \o x + y \o y}) = I_{xy}\), because we contracted the first indices together. When used on a vector as a rotation operator, we need a rule like this:</p>
\[R_{xy}^2 = - (\b{x \^ y})^{\cdot_1 2}\]
<p>In general, contracting operators that are going to act on grade-\(k\) objects gives \(O \circ O = (-1)^k O^{\cdot 2}\). But I don’t think it’s worth thinking too hard about this: the behavior is very specific to the usage.</p>
<p><strong>Partial Star:</strong></p>
<p>One funny thing we can do with a partial trace is apply \(\star\) to one component of a multivector :</p>
\[\star_k \alpha = \alpha \cdot_k \omega\]
<p>Example in \(\bb{R}^3\):</p>
\[\begin{aligned}
\star_1 \b{x \^ y} &= (\star \b{x}) \o \b{y} - (\star \b{y}) \o \b{x} \\
&= (\b{y \^ z}) \o \b{y} - (\b{z \^ x}) \o \b{x}
\end{aligned}\]
<p>I would have thought this was overkill and would never be useful, but it turns out it has a usage in the next section.</p>
<p><strong>Coproduct slice:</strong></p>
<p>Prior to this section I haven’t really considered tensor powers of exterior algebras like \(\^ V \o \^ V\) in general before, except for wedge powers of matrices like \(\^^2 A\). But they do come up in the literature sometimes. Rota & Co had an operation they called the “coproduct slice” of a multivector, which splits a multivector in two by antisymmetrically replacing one of the \(\^\) positions with a \(\o\), like this:</p>
\[\p_{2,1} (\b{x \^ y \^ z}) = (\b{x \^ y}) \o \b{z} + (\b{y \^ z}) \o \b{x} + (\b{z \^ x}) \o \b{y}\]
<p>This gets at the idea that any wedge product (the free antisymmetric multilinear product) factors through the tensor product (the free multilinear product), and some concepts make more sense on the tensor product. For instance, it makes more sense to me to take the trace of two tensored terms than of two wedged terms. In general I am still trying to figure out for myself whether the “quotient algebra” or “antisymmetric tensor algebra” senses of \(\^\) are more important and fundamental, and the right way to think about the two.</p>
<p>Up to a sign, the coproduct slice can be implemented by tracing over the unit basis \(k\)-vectors:</p>
\[\p_{k, n-k} \beta = \sum_{\alpha \in \^^k V} \alpha \o (\alpha \cdot \beta )\]
<hr />
<h2 id="8-the-meet-vee">8. The Meet \(\vee\)</h2>
<p>\(\star\) maps every multivector to another one. Its action on the wedge product is to produce a dual operation \(\vee\), called the <em>meet</em> (recall that the wedge product is also aptly called the ‘join’).</p>
\[(\star \alpha) \vee (\star \beta) = \star(\alpha \^ \beta)\]
<p>The result is a complete exterior algebra because it’s isomorphic to one under \(\star\). So <em>both</em> of these are valid exterior algebras obeying the exact same rules:</p>
\[\^ V = (\^, V)\]
\[\vee V = (\vee, \star V)\]
<p>All operations work the same way if a \(\star\) is attached to every argument and we replace \(\^\) with \(\vee\):</p>
\[\star (\b{a} \^ \b{b}) = (\star \b{a}) \vee (\star \b{b})\]
<p>\(\vee \bb{R}^2\) is, for instance, spanned by \((\star 1, \star \b{x}, \star \b{y}, \star (\b{x} \^ \b{y})) = (\b{x \^ y}, \b{y}, - \b{x}, 1)\)</p>
<p>Sometimes \((\^, \v, V)\) is called a ‘double algebra’: a vector space with a choice of pseudoscalar and two dual exterior algebras. It’s also called the <a href="https://en.wikipedia.org/wiki/Grassmann%E2%80%93Cayley_algebra">Grassman-Cayley Algebra</a>. I like to write it as \(\^{ \v }V\).</p>
<p>The meet is kinda weird. It is sorta like computing the union of two linear subspaces:</p>
\[(\b{x \^ y}) \vee (\b{y \^ z}) = (\star\b{z}) \vee (\star\b{x}) = \star (\b{z \^ x}) = \b{y}\]
<p>But it only works if the degrees of the two arguments add up to \(\geq n\):</p>
\[\b{x} \vee \b{y} = \star(\b{y \^ z} \^ \b{z \^ x}) = 0\]
<p>A general definition is kinda awkward, but we can do it using the \(\star_k\) operation from the previous section. It looks like this:</p>
\[\alpha \vee \beta = (\star_{\| \beta \|} \alpha) \cdot \beta\]
<p>The \(\alpha\) will be inner-product’d with the \(\star\)‘d terms of \(\beta\). Recall that \(\star_k \beta\) becomes a sum of tensor products \(\beta_1 \o \beta_2\). We end up dotting \(\alpha\) with the first term:</p>
\[\alpha \vee \beta = [\sum_{\alpha_1 \^ \alpha_2 = \alpha} (\star \alpha_1) \o \alpha_2] \cdot \beta = \sum_{\alpha_1 \^ \alpha_2 = \alpha} (\star \alpha_1 \cdot \beta) \alpha_2\]
<p>(This is a sum over ‘coproduct slices’ of \(\alpha\), in one sense. This kind of sum is called ‘Sweedler Notation’ in the literature.) This is non-zero only if \(\beta\) contains all of the basis vectors <em>not</em> in \(\alpha\). It makes more sense on an example:</p>
\[\begin{aligned}
(\b{x \^ y}) \vee (\b{y} \^ \b{z}) &= \star_1 (\b{x \^ y}) \cdot (\b{y} \^ \b{z}) \\
&= (\b{y \^ z}) \o \b{y} - (\b{z \^ x}) \o \b{y}) \cdot (\b{y \^ z}) \\
&= \b{y}
\end{aligned}\]
<p>In index notation:</p>
\[(\alpha \vee \beta)_K = \alpha_{IJ} \e^{IK} \beta_{}\]
<p>Or we can directly translate \((\star \alpha) \vee (\star \beta) = \star(\alpha \^ \beta)\):</p>
\[(\star \alpha \vee \star \beta)^K = (\e_{IJ} \alpha^I) (\e_{JK} \beta^J) \e^{JKL}\]
<p>Note: I got exhausted trying to verify the signs on this, so they might be wrong. At some point I’ll come back and fix them.</p>
<p>Note 2: remember that \(\star^{-1} = (-1)^{k(n-k)} \star \neq \star\) in some dimensions, so you need to be careful about applying the duality to compute \(\vee\): \(\alpha \vee \beta = \star(\star^{-1} \alpha \^ \star^{-1} \beta)\). Also note that, since \(\vee\) is defined in terms of \(\star\), it is explicitly dependent on the choice of \(\omega\).</p>
<p>As mentioned above, the symbols for join and meet are definitely <em>swapped</em> in a way that’s going to be really hard to fix now. It should be meet = \(\^\), join = \(\vee\), so it matches usages everywhere else, as well as usages of \(\cup\) and \(\cap\) from set / boolean algebras.</p>
<p>Since \(\vee V\) is part of another complete exterior algebra, it also has all of the other operations, including a ‘dual interior product’ \(\alpha \cdot_{\vee} \beta\). I have never actually seen it used, but it exists.</p>
<hr />
<h2 id="9-relative-vee_mu-_mu-and-star_mu">9. Relative \(\vee_\mu\), \(\^_\mu\), and \(\star_\mu\),</h2>
<p>We saw that \(\star\) and by extension are \(\vee\) defined relative to a choice of pseudoscalar \(\omega\). What if we choose differently? It turns out that this is actually occasionally useful – I saw it used in <em>Oriented Projective Geometry</em> by Jorge Stolfi, which develops basically all of exterior algebra under an entirely different set of names. We write \(\star_{\mu}\) and \(\vee_{\mu}\) for the star / meet operations relative to a ‘universe’ multivector \(\mu\):</p>
\[\star_{\mu} \alpha = \alpha \cdot \mu\]
\[(\star_\mu \alpha) \vee_{\mu} (\star_\mu \beta) = \star_{\mu} (\alpha \^ \beta)\]
<p>The regular definitions set \(\mu = \omega\). The resulting exterior algebra shows us that any subset of the basis vectors of a space form an exterior algebra themselves. In case this seeems like pointless abstraction, I’ll note that it does come up, particularly when dealing with projective geometry. If \(\b{w}\) is a projective coordinate, we can write the projective \(\star_{\b{wxyz}}\) in terms of \(\star_{\b{xyz}}\):</p>
\[\star_{\b{wxyz}}( w \b{w} + x \b{x} + y \b{y} + z \b{z}) = \b{w} \^ \star_{\b{xyz}}(x\b{x} + y\b{y} +z \b{z}) + w (\b{x \^ y \^ z})\]
<p>There is also a way to define \(\^\) relative to a ‘basis’ multivector, \(\^_{\nu}\). The behavior is to join two multivectors ignoring their component along \(\nu\):</p>
\[(\nu \^ \alpha) \^_{\nu} (\nu \^ \beta) = \nu \^ (\alpha \^ \beta)\]
<p>For unit \(\nu\), this can be implemented as:</p>
\[\alpha \^_{\nu} \beta = \nu \^ (\nu \cdot \alpha) \^ (\nu \cdot \beta))\]
<p>It’s neat that for choices of \(\nu, \mu\), we can produce another exterior double algebra embedded within \((\^, \v, V)\):</p>
\[(\^_{\nu}, \v_{\mu}, \nu, \mu, V)\]
<p>Our regular choice of exterior algebra on the whole space is then given by:</p>
\[(\^, \v, V) = (\^_1, \v_\omega, 1, \omega, V)\]
<hr />
<h2 id="10-the-geometric-product-alphabeta">10. The Geometric Product \(\alpha\beta\)</h2>
<p>There is much to say about <a href="https://en.wikipedia.org/wiki/Geometric_algebra">Geometric algebra</a> and the ‘geometric product’. (Other names: “Clifford Algebra”, “Clifford Product”.)</p>
<p>GA is how I got into this stuff in the first place, but I avoid using the name for the most part because there is some social and mathematical baggage that comes with it. But its proponents deserve credit for popularizing the ideas of multivectors in the first place – I’m pretty sure we all agree that multivectors, as a concept, should be used and taught everywhere.</p>
<p>The social baggage is: the field, while perfectly credible in theory, tends to attract an unusual rate of cranks (many of them ex-physics students who want to ‘figure it all out’ – like myself! I might be a crank. I’m not sure.). The mathematical baggage is the proliferation of notations that are hard to use and not very useful.</p>
<p>The geometric product is a generalization of complex- and quaternion-multiplication to multivectors of any grade. The inputs and outputs are linear combinations of multivectors of any grade. It’s generally defined as another quotient of the tensor algebra: instead of just \(x \o x \sim 0\), as defined the exterior algebra, we use \(x \o y \sim - y \o x \sim 0\) (so we can still exchange positions of elements in a tensor) but \(x \o x \sim 1\). This means duplicate tensor terms are just replaced with \(1\) in tensor products, rather than annihilating the whole thing, like this:</p>
\[x \o x \o y \o x \o y \sim (x \o x) \o y \o (-y) \o x \sim -x\]
<p>The geometric product is the action of \(\o\) under this equivalence relation. In geometric algebra texts it is written with juxtaposition, since it generalizes scalar / complex multiplication that are written that way. I’ll do that for this section.</p>
\[(\b{xy})(\b{xyz}) = (\b{xy}) (-\b{yxz}) = (\b{x})(\b{xz}) = -\b{z}\]
<p>It’s associative, but not commutative or anticommutative in general.</p>
<p>The primary reason to use this operation is that its implementations on \(\bb{R}^2\), \(\bb{R}^3\), and \(\bb{R}^{3,1}\) are already used:</p>
<ul>
<li>The geometric product on even-graded elements of \(\bb{R}^2\) implements complex multiplication.</li>
<li>The geometric product on even-graded elements of \(\bb{R}^3\) implements quaternion multiplication.</li>
<li>The geometric product on four elements \((\b{t, x, y, z})\) with the \(x^2 = y^2 = z^2 = -1\) is implemented by the <a href="https://en.wikipedia.org/wiki/Gamma_matrices">gamma matrices</a> \(\gamma^{\mu}\) which are used in quantum mechanics.
<ul>
<li>(I won’t discuss the alternate metric in this article, but it’s done by using \(x \o x \sim Q(x,x)\) in the quotient construction of the algebra, where \(Q\) is the symmetric bilinear form that’s providing a metric.)</li>
</ul>
</li>
</ul>
<p>Geometric algebra tends to treat the geometric product as fundamental, and then produce the operations from it. For vectors, the definitions are:</p>
\[\< \b{a}, \b{b} \> = \frac{1}{2}(\b{ab + ba})\]
\[\b{a} \^ \b{b} = \frac{1}{2}(\b{ab - ba})\]
<p>But we could also define things the other way:</p>
\[\b{ab} = \frac{1}{2}(\b{a \cdot b} + \b{a \^ b})\]
<p>Multivector basis elements are just written by juxtaposing the relevant basis vectors, since \(\b{xy} = \b{x \^ y}\). I like this notation and should start using it even if I avoid the geometric product; it would save a lot of \(\^\)s.</p>
<p>To define the geometric product in terms of the other operations on this page, we need to define the <strong>reversion</strong> operator, which inverts the order of the components in a geometric product (with \(k\) as the grade of the argument):</p>
\[(abcde)^{\dag} = edcba = (-1)^{k(k-1)/2} (abcde)\]
<p>This generalizes complex conjugation, since it takes \(\b{xy} \ra -\b{xy}\) in \(\bb{R}^2\) and \(\bb{R}^3\). It allows us to compute geometric products, which contracts element from inner to outer, using the operations already defined on this page, which I have defined as contracting left-to-right in every case. The general algorithm for producing geometric products out of previously-mentioned operations then is to try projecting the onto <em>every</em> basis multivector:</p>
\[\alpha \beta = \sum_{\gamma \in \^^ V} (\gamma \cdot \alpha^\dag) \^ (\gamma \cdot \beta)\]
<p>This translates into index notation as:</p>
\[\alpha \beta = \sum_{\gamma \in \^^ V} (-1)^{\| \alpha \| ( \| \alpha \| -1)/2} \gamma_I \gamma_K \alpha^{I}_{[J}\beta^{K}_{L]}\]
<p>I think we can agree that’s pretty awkward. But it’s hard to be sure what to do with it. Clearly it’s <em>useful</em>, at least in the specific cases of complex and quaternions multiplication.</p>
<p>My overall opinion on the geometric product is this:</p>
<ul>
<li>I <em>tentatively</em> think that it is mis-defined to use inner-to-outer contraction, because of the awkward signs and conjugation operations that result.
<ul>
<li>I suspect the appeal of defining contraction this way was to make \((\b{xy})^2 = -1\), in order to produce something analogous to \(i^2 = -1\). But imo it’s really much more elegant if all basis elements have \(\alpha^2 = 1\).</li>
<li>If we want to preserve the existing of a multiplication operation with \(\alpha^2 = -1\), we can <em>define</em> the geometric product as \(\alpha \beta = \alpha^{\dag} \cdot \beta\) or something like that. Maybe.</li>
<li>Associativity is really nice, though. So maybe it’s my definition of the other products that’s wrong for doing away with it.</li>
</ul>
</li>
<li>However, it works suspiciously well for complex numbers, quaternions, and gamma matrices.</li>
<li>And it works suspiciously well for producing something that acts like a multiplicative inverse (see below).</li>
<li>But I know of almost zero cases where mixed-grade multivectors are useful, except for sums of “scalars plus one grade of multivector”.</li>
<li>I can’t find any general geometric intuition for the product in general.</li>
<li>So I’m mostly reserving judgment on the subject, until I figure out what’s going on more completely.</li>
</ul>
<hr />
<p><strong>Other operations of geometric algebra</strong></p>
<p>Unfortunately geometric algebra is afflicted by way too many other unintuitive operations. Here’s most of them:</p>
<ol>
<li><strong>Grade projection</strong>: \(\< \alpha \>_k = \sum_{\gamma \in \^^k V} (\gamma \cdot \alpha) \o \gamma\) extracts the \(k\)-graded terms of \(\alpha\).</li>
<li><strong>Reversion</strong>: \((abcde)^{\dag} = edcba = (-1)^{r(r-1)/2} (abcde)\). Generalizes complex conjugation.</li>
<li><strong>Exterior product</strong>: same operation as above, but now defined \(A \^ B = \sum_{r,s} \< \< A \>_r \< B \>_s \>_{r + s}\)</li>
<li><strong>Commutator product</strong>: \(A \times B = \frac{1}{2}(AB - BA)\). I don’t know what the point of this is.</li>
<li><strong>Meet</strong>: same as above, but now defined \(A \vee B = I(AI^{-1}) \^ (BI^{-1})\). GA writes the pseudoscalar as \(I\) and \(AI^{-1} = \star^{-1} A\).</li>
<li><strong>Interior product</strong>: for some reason there are a bunch of ways of doing this.
<ul>
<li><strong>Left contraction</strong>: \(A ⌋ B = \sum_{r,s} \< \< A \>_r \< B \>_s \>_{r - s}\)</li>
<li><strong>Right contraction</strong>: \(A ⌊ B = \sum_{r,s} \< \< A \>_r \< B \>_s \>_{s - r}\)</li>
<li><strong>Scalar product</strong>: \(A * B = \sum_{r,s} \< \< A \>_r \< B \>_s \>_{0}\)</li>
<li><strong>Dot product</strong>: \(A \cdot B = \sum_{r,s} \< \< A \>_r \< B \>_s \>_{\| s - r \|}\)</li>
</ul>
</li>
<li>There are a few other weird ‘conjugation’ operations (see <a href="https://en.wikipedia.org/wiki/Paravector">here</a>) but I think they’re thankfully fading out of usage.</li>
</ol>
<hr />
<h2 id="11-multivector-division-alpha-1">11. Multivector division \(\alpha^{-1}\)</h2>
<p>Ideally division of multivectors would produce a multivector \(\alpha^{-1}\) that inverts \(\^\):</p>
\[\frac{\alpha \^ \beta}{\alpha} = \beta\]
<p>There are several problems with this, though. One is that \(\alpha \^ \beta\) may be \(0\). Another is that \(\^\) isn’t commutative, so presumably \(\alpha^{-1} (\alpha \^ \beta)\) and \((\alpha \^ \beta) \alpha^{-1}\) are different. Another is that \(\beta + K \alpha\) is also a solution for any \(K\):</p>
\[\alpha \^ (\beta + K \alpha) = \alpha \^ \beta\]
<p>Or for any multivector \(\gamma\) with \(\gamma \^ \alpha = 0\):</p>
\[\alpha \^ (\beta + \gamma) = \alpha \^ \beta\]
<p>So there are at least a few ways to define this.</p>
<p><strong>Multivector division 1</strong>: Use the interior product and divide out the magnitude:</p>
\[\alpha^{-1} \beta = \frac{\alpha}{\| \alpha \|^2} \cdot \beta\]
<p>This gives up on trying to find <em>all</em> inverses, and just identifies one of them. It sorta inverts the wedge product, except it extracts only the orthogonal component in the result:</p>
\[\b{a}^{-1} (\b{a} \^ \b{b}) = \frac{\b{a}}{\| \b{a} \|^2} \cdot (\b{a} \^ \b{b}) = \b{b} - \frac{\b{a} (\b{a} \cdot \b{b})}{\| \b{a} \|^2} = \b{b} - \b{b}_{\parallel \b{a}} = \b{b}_{\perp \b{a}}\]
<p>The result is the ‘rejection’ of \(\b{b}\) off of \(\b{a}\). It doesn’t quite ‘invert’ \(\^\), but it’s a pretty sensible result. It is commutative due to our definition of the two sided interior product (both terms on contract left-to-right either way). If \(\b{a \^ b} = 0\) in the first place, then this rightfully says that \(\b{b}_{\perp \b{a}} = 0\) as well, which is nice.</p>
<p><strong>Multivector division 2</strong>: Allow the result to be some sort of general object, not a single-value:</p>
\[\alpha^{-1} \beta = \frac{\alpha}{\| \alpha \|^2} \cdot \beta + K\]
<p>where \(K\) is “the space of all multivectors \(\gamma\) with \(\alpha \^ \gamma = 0\)”. This operation produces the true preimage of multiplication via \(\^\), at the loss of an easy way to represent the result. But I suspect this definition is good and meaningful and is sometimes necessary to get the ‘correct’ answer.</p>
<p><strong>Multivector division 3</strong>: Use the geometric product.</p>
<p>The geometric product produces something that actually <em>is</em> division on GA’s versions of complex numbers and quaternions (even-graded elements of \(\^ \bb{R}^2\) and \(\^ \bb{R}^3\)):</p>
\[a^{-1} b = \frac{ab}{aa} = \frac{ab}{\| a \|^2}\]
<p>This is only defined for \(\| a \| \neq 0\) (remember, since GA has elements with \(\alpha^2 = -1\), you can have \(\| 1 + i \|^2 = 1^2 + i^2 = 0\)). You can read a lot about this inverse online, such as how to use it to reflect and rotate vectors.</p>
<hr />
<p>Cut for lack of time or knowledge:</p>
<ul>
<li>Exterior derivative and codifferential</li>
<li><a href="https://en.wikipedia.org/wiki/Cap_product">Cup and cap product</a> from algebraic topology. As far as I can tell these essentially implement \(\^\) and \(\vee\) on co-chains, which are more-or-less isomorphic to multivectors.</li>
</ul>
<hr />
<p>Other articles related to Exterior Algebra:</p>
<ol start="0">
<li><a href="/2018/08/06/oriented-area.html">Oriented Areas and the Shoelace Formula</a></li>
<li><a href="/2018/10/08/exterior-1.html">Matrices and Determinants</a></li>
<li><a href="/2018/10/09/exterior-2.html">The Inner product</a></li>
<li><a href="/2019/01/26/hodge-star.html">The Hodge Star</a></li>
<li><a href="/2019/01/27/interior-product.html">The Interior Product</a></li>
<li><a href="/2020/10/15/ea-operations.html">All the Exterior Algebra Operations</a></li>
</ol>
The essence of complex analysis2020-08-10T00:00:00+00:00http://alexkritchevsky.com/2020/08/10/complex-analysis<p>Rapid-fire intuitions for calculus on complex numbers, with little rigor.</p>
<p>Not an introduction to the subject.</p>
<!--more-->
<p>Contents:</p>
<ul id="markdown-toc">
<li><a href="#1-the-complex-plane" id="markdown-toc-1-the-complex-plane">1. The complex plane</a></li>
<li><a href="#2-holomorphic-functions" id="markdown-toc-2-holomorphic-functions">2. Holomorphic functions</a></li>
<li><a href="#3-residues" id="markdown-toc-3-residues">3. Residues</a></li>
<li><a href="#4-integral-tricks" id="markdown-toc-4-integral-tricks">4. Integral tricks</a></li>
<li><a href="#5-topological-concerns" id="markdown-toc-5-topological-concerns">5. Topological concerns</a></li>
<li><a href="#6-convergence-concerns" id="markdown-toc-6-convergence-concerns">6. Convergence concerns</a></li>
<li><a href="#7-global-laurent-series" id="markdown-toc-7-global-laurent-series">7. Global Laurent Series</a></li>
</ul>
<hr />
<h2 id="1-the-complex-plane">1. The complex plane</h2>
<p>Most of calculus on \(\bb{C}\) is actually just calculus on \(\bb{R}^2\), under the substitutions:</p>
\[\begin{aligned}
i &\lra R \\
x + iy & \lra (x + R y) \hat{x}
\end{aligned}\]
<p>Where \(R\) is a rotation operator. The identity \(\cos \theta + i \sin \theta = e^{i \theta}\) follows from applying the <a href="https://en.wikipedia.org/wiki/Exponential_map">exponential map</a> to \(R\) as the generator of rotations. If I had my way we would not use complex numbers ever and would just learn the subject as ‘calculus using rotation operators’ to avoid a proliferation of things that seem like magic.</p>
<p>\(\bb{R}^2\) is a two-dimensional space, though, while \(\bb{C}\) appears to have one ‘complex’ dimension. This is a bit strange, but for the most part you can just treat \(z, \bar{z}\) like any other two dimensional space. The tangent basis forms are:</p>
\[\begin{aligned}
dz &= dx + i dy \\
d\bar{z} &= dx - i dy
\end{aligned}\]
<p>The partial derivatives are for some reason given the name <a href="https://en.wikipedia.org/wiki/Wirtinger_derivatives">Wirtinger derivatives</a>:</p>
\[\begin{aligned}
\p_z &= \frac{1}{2}(\p_x - i \p_y) \\
\p_{\bar{z}} &= \frac{1}{2}(\p_x + i \p_y)
\end{aligned}\]
<p>The \(\frac{1}{2}\) factors and the swapping of signs is required such that \(\p_z (z) = \p_{\bar{z}} (\bar{z}) = 1\). In an alternate universe both sides might have had \(\frac{1}{\sqrt{2}}\) factors.</p>
<p>Note that any function that explicitly uses \(r\) or \(\theta\) has a \(\bar{z}\) dependency unless they cancel it out somehow (like \(z = re^{i \theta}\) does):</p>
\[\begin{aligned}
r &= \sqrt{z \bar{z}} \\
\theta &= - \frac{i}{2} \log \frac{z}{\bar{z}}
\end{aligned}\]
<hr />
<h2 id="2-holomorphic-functions">2. Holomorphic functions</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Riemann_equations">Cauchy-Riemann equations</a> tell you when a complex function \(f(z) = u(x+iy) + i v(x + iy)\) is complex-differentiable:</p>
\[\begin{aligned}
u_x = v_y\\
u_y = - v_x
\end{aligned}\]
<p>Being complex differentiable means that \(f(z)\) has a derivative that is itself a complex number: \((f_x, f_y) \in \bb{C}\) when regarded as part of \(\bb{R}^2\). In fact, the equations express the idea that \(f\) has no derivative with respect to \(\bar{z}\):</p>
\[\begin{aligned}
\p_{\bar{z}} f(z)
&= \frac{1}{2} (f_x + i f_y) \\
&\propto u_x + i v_x + i u_y - v_y \\
&= (u_x - v_y) + i (v_x + u_y) \\
&= 0 + i 0
\end{aligned}\]
<p>As long as \(f\) is continuous and this condition is true in a region \(D\), operations on \(f(z)\) essentially work like they would for one-variable functions in \(z\). For instance \(\p_z (z^n) = n z^{n-1}\).</p>
<p>While \(z\) seems like a 2-dimensional variable, there’s only one ‘degree of freedom’ in the derivative of a complex function. \(f'(z)\) has to be a simple complex number, which rotates and scales tangent vectors uniformly (a <a href="https://en.wikipedia.org/wiki/Conformal_map">conformal map</a>):</p>
\[f(z + dz) \approx f(z) + f'(z) dz = f(z) + re^{i\theta} dz\]
<p>Functions which are complex-differential at every point within a region are called <a href="https://en.wikipedia.org/wiki/Holomorphic_function">holomorphic</a> in that region for some reason. A function \(f(z)\) that is holomorphic (or ‘regular’?) in a region \(D\) is <em>extremely</em> well-behaved:</p>
<ul>
<li>\(f\) is <em>infinitely</em> complex-differentiable</li>
<li>and \(f\) is ‘complex analytic’, ie equal to its Taylor series in \(z\) throughout \(D\). The series around any particular point converges within the largest circular disk that stays within \(D\).</li>
<li>and \(f\) is locally invertible, ie \(f^{-1}(w + dw) \approx z + 1/f'(z) dw\) exists and is holomorphic in the neighborhood of \(z = f(w)\).</li>
<li>its antiderivatives exist, and its integrals along any closed contour vanishes: \(\oint_C f(z) dz = 0\).</li>
<li>the data of \(f\) in \(D\) is fully determined by its values on the boundary of the region, or on any one-dimensional curve within \(D\), or on some nontrivial subregion of \(D\).</li>
</ul>
<p>The general theme is that holomorphic/analytic functions generally act like one-dimensional functions and all of the calculus is really easy on them. This tends to be true much more than it is for 1d calculus.</p>
<p>If two analytic functions defined on different regions <em>agree</em> on an overlapping region, they are in a sense the ‘same function’. This lets you <a href="https://en.wikipedia.org/wiki/Analytic_continuation">analytically continue</a> a function by finding other functions which agree on a particular line or region. An easy case is to ‘glue together’ Taylor expansions around different points to go around a divergence.</p>
<p>Most 1d functions like \(e^x\) and \(\sin x\) have holomorphic complex versions like \(e^z\) and \(\sin z\) that are analytic everywhere. Discontinuous functions like \(\|z\|\) or \(\log z = i \theta \ln r\), or functions that include an explicit or implicity \(\bar{z}\) dependency, fail to be analytic somewhere.</p>
<p>Complex differentiability fails at singularities. We categorize the types:</p>
<ul>
<li><em>poles</em> of order \(n\), around which \(f(z) \sim 1/z^n\), which are ‘well-behaved’ singularities. Around these there’s a region where \(1/f\) is analytic. ‘Zeros’ and ‘poles’ are dual in the sense that \(f \sim z^n\) at zeroes and \(f \sim 1/z^n\) at poles.</li>
<li><em>removable singularities</em>: singularities that can be removed by redefinition, probably because they’re an indeterminate form. The canonical example is \(\sin(z)/z\) which is repaired by defining \(\sin(0)/0 = 1\). In a sense these are not singularities at all.</li>
<li><em>essential singularities</em>: singularities which oscillate infinitely rapidly near a point, such that they are in a sense too complicated to handle. \(\sin(1/z)\) or \(e^{1/z}\) are the canonical examples. They all look like this, oscillating infinitely: <a href="https://en.wikipedia.org/wiki/Picard_theorem">Great Picard’s Theorem</a> (what a name) says that near an essential singularity the function takes every value infinitely times, except possibly one.</li>
</ul>
<p>Poles are much more interesting than the other two.</p>
<hr />
<h2 id="3-residues">3. Residues</h2>
<p>No one would really care about complex analysis except for, well, <em>analysts</em>, were it not for one suspicious fact about the complex derivative:</p>
\[\p_{\bar{z}} \frac{1}{z} \neq 0\]
<p>Make sure you see that that’s a \(\bar{z}\)-derivative. For some reason, \(z^n\) for only \(n=-1\) has a certain kind of divergence at \(z=0\). It looks like a 2d <a href="https://en.wikipedia.org/wiki/Dirac_delta_function">delta <strike>function</strike> distribution</a>:</p>
\[\p_{\bar{z}} \frac{1}{z} = 2 \pi i \delta (z, \bar{z})\]
<p>This is intrinsically related to the fact that we’re doing calculus in 2d. In 3d a similar property holds for \(1/z^2\), and in 1d it’s \(\p_x \log x = \frac{1}{x} + i \pi \delta(x)\) that has the delta term.</p>
<p>This is equivalent to saying that the contour integral (integral on a closed path) of \(1/z\) around the origin is non-zero:</p>
\[\begin{aligned}
\oint \frac{1}{z} dz &= \oint \frac{e^{i \theta} dr + ir e^{i\theta} d \theta }{r e^{i \theta}} \\
&= \oint \frac{dr}{r} + i d \theta \\
&= 2 \pi i
\end{aligned}\]
<p>It’s clear why this non-zero contour only holds for \(z^{-1}\): for any other \(z^n\), the \(d \theta\) term is still a non-constant function of \(\theta\), so its values on each end cancel out. For \(n=-1\), though, the \(d \theta\) just counts the total change in angle.</p>
<p>The delta-function version folllows from Stoke’s theorem. Since the contour integral gives the same value on any path as long as it circles \(z=0\), the divergence must be fully located at that point:</p>
\[\begin{aligned}
\oint_{\p D} \frac{1}{z} dz &= \iint_D d(\frac{dz}{z}) \\
2\pi i &= \iint_D \p_{\bar{z}} \frac{1}{z} d \bar{z} \^ dz \\
\p_{\bar{z}} \frac{1}{z} &\equiv 2 \pi i \delta(z, \bar{z})
\end{aligned}\]
<p>A function that is holomorphic except at a set of poles is called <em>meromorphic</em> (‘mero-‘ is <a href="https://www.etymonline.com/search?q=mero-">Greek</a>, meaning ‘part’ or ‘fraction’). If we integrate a meromorphic function around a region \(D\) the result only contains contributions from the \(\frac{1}{z}\) terms. Around each order-1 pole at \(z_k\), \(f(z_k) \sim f_{-1} \frac{1}{z_k} + f^{*}(z_k)\) where \(f^{*}\) has no \(z^{-1}\) term. The \(f_{-1}\) values at each pole are for some reason called <a href="https://en.wikipedia.org/wiki/Residue_theorem">residues</a>, and:</p>
\[\int_{\p D} f(z) dz = 2 \pi i \sum_{z_k} I(\p D, z_k) \text{Res} (f, z_k)\]
<p>Where \(I(\p D, z_k)\) gives the <a href="https://en.wikipedia.org/wiki/Winding_number">winding number</a> around the order-1 pole (+1 for single positive rotation, -1 for a single negative rotation, etc).</p>
<p>This makes integration of analytic functions around closed contours <em>really easy</em>; you can often just eyeball them:</p>
\[\oint_{\p D} \frac{1}{z-a} dz = 2\pi i 1_{a \in D}\]
<p>Multiplying and dividing powers of \((z-a)\) and then integrating around a curve containing \(a\) allows you to extract any term in the Taylor series of \(f(z)\) around \(a\):</p>
\[f_n = f^{(n)}(z)_{z=a} = \frac{n!}{2 \pi i} \oint f(z) (z-a)^{n-1} dz\]
<p>This is called <a href="https://en.wikipedia.org/wiki/Cauchy%27s_integral_formula">Cauchy’s Integral Theorem</a>. When negative terms are present the Taylor series is instead called a <a href="https://en.wikipedia.org/wiki/Laurent_series">Laurent Series</a>.</p>
\[\begin{aligned}
f(z) &\approx \sum f_n \frac{(z-a)^n}{n!} \\
&= \ldots + \frac{f_{-1}}{z} + f_0 + f_{1} z + f_2 \frac{z^2}{2!} + \ldots
\end{aligned}\]
<p>In particular the value at \(z=a\) is fully determined by the contour integral with \((z-a)^{- 1}\):</p>
\[f(a) = f_0 = \frac{1}{2 \pi i} \oint \frac{f(z)}{z-a} dz\]
<p>You can, of course, formulate this whole thing in terms of \(f(\bar{z})\) and \(\frac{dz}{\bar{z}}\) instead. If a function isn’t holomorphic in either \(z\) or \(\bar{z}\), you can still do regular \(\bb{R}^2\) calculus in two variables \(f(z, \bar{z})\), although I’m not sure how you would deal with poles.</p>
<p>There is a duality between zeroes and poles of meromorphic functions – in the region of a pole of a function \(f\), the function behaves like \(\frac{1}{g}\) where \(g\) is an analytic function. In general a meromorphic function can be written as \(f= \frac{h}{g}\) where \(g,h\) are analytic, with the zeroes of \(g\) corresponding to the poles of \(f\).</p>
<hr />
<h2 id="4-integral-tricks">4. Integral tricks</h2>
<p>Laurent series and the ‘calculus of Residues’ gives rise to a whole slew of integration tricks.</p>
<p>Closed integrals of a function with a Laurent series can be eyeballed using the Cauchy integral formula:</p>
\[\begin{aligned}
\oint_{r=1} \frac{1}{z(z-2)} dz &= \frac{1}{2} \oint_{r=1} \frac{1}{z-2} - \frac{1}{z} dz \\
&= 2 \pi i \frac{1}{2} (-1) \\
&= - \pi i
\end{aligned}\]
<p>Integrals along the real line \(\int_{-\infty}^{\infty}\) can often be computed by ‘closing the contour’ at \(+ r \infty\). This is especially easy if the integrand vanishes at \(r=\infty\), but also helps if it’s just easier to integrate there.</p>
\[\begin{aligned}
\int_{-\infty}^{\infty} \frac{1}{1 + x^2} dx &= \int_{r = -\infty}^{r = \infty} \frac{dz}{1 + z^2} + \int_{\theta=0, \, r=\infty}^{\theta=\pi, \, r=\infty} \frac{dz}{1 + z^2} \\
&= \oint \frac{1}{z - i} \frac{1}{z + i} dz \\
&= 2 \pi i \text{Res}(z=i, \frac{1}{z - i} \frac{1}{z + i}) \\
&= 2\pi i \frac{1}{2i} \\
&= \pi
\end{aligned}\]
<p>Here we closed the contour around the upper-half plane, upon which the integrand is \(0\) due to the \(r^2 \ra \infty\). One pole is the upper-half plane and one is in the lower. The winding number around the upper is \(+1\) and the residue is \(\frac{1}{z+i}\) evaluated at \(z=i\), or \(1/2i\). If we had used the lower half-plane the winding number would have been \(-1\) and the residue \(-1/2i\), so the result is independent of how we closed the contour. This method gives the answer very directly without having to remember that \(\int \frac{dx}{1 + x^2} = \tan^{-1} x\) or anything like that.</p>
<p>Note that this wouldn’t work if the pole was <em>on</em> the path of integration, as in \(\int_{-\infty}^{+\infty} \frac{1}{x} dx\). This integral is the <a href="https://en.wikipedia.org/wiki/Cauchy_principal_value">Cauchy Principal Value</a> and is in a sense an indefinite form like \(0/0\) whose value depends on the context. More on that another time.</p>
<p>Many other integrals are solvable by choosing contours that are amenable to integration. Often choices that keep \(r\) or \(\theta\) constant are easiest. See Wikipedia on <a href="https://en.wikipedia.org/wiki/Contour_integration">contour integration</a> for many examples.</p>
<hr />
<h2 id="5-topological-concerns">5. Topological concerns</h2>
<p>There are some tedious things you have to account for when considering functions of \(z\).</p>
<p>First, the \(\theta\) variable is discontinuous, since \(\theta = 0\) and \(\theta = 2\pi\) refer to the same point. This means that inverting a function of \(\theta\) will produce a <a href="https://en.wikipedia.org/wiki/Multivalued_function">multi-valued function</a>:</p>
\[\log e^{i \theta} = i \theta + 2 \pi i k_{\in \bb{Z}}\]
<p>Smoothly varying \(\theta = \int d \theta\) of course will just continue to tick up: \(2\pi, 4\pi, 6\pi\), etc. But the \(\log\) function itself appears to have a discontinuity of \(2\pi i\) at \(\theta = 0\).</p>
<p>When dealing with these multi-valued functions you can consider \(\theta = 0\) as a ‘branch point’ – a place where the function becomes multi-valued. But to be honest the whole theory of branch points isn’t very interesting if you aren’t a mathematician. I prefer to just think of all math being done modulo \(2 \pi\), or, if you need the discontinuity to count because you’re doing contour integrals, just get over the idea that functions can’t have multiple path-dependent values and don’t demand it have a unique inverse.</p>
<p>Another topological interest in \(\bb{C}\): if you ‘join together’ the points at infinity in every direction by defining a symbol \(\infty\) such that \(1/0 = \infty\), you get the “extended complex plane” or the <a href="https://en.wikipedia.org/wiki/Riemann_sphere">Riemann sphere</a>, since it is topologically shaped like a sphere. Most of the things that seem like they should be true involving \(\infty\) are true in this case. For example, the asymptotes of \(\frac{1}{z}\) on either size of \(\| z \| = 0\) really <em>do</em> connect at infinity and come back on the other side.</p>
<p>The Riemann sphere is topologically like a sphere, but acts like a <em>projective</em> plane, which is a bit unintuitive. (This corresponds rather to a half sphere where antipodal points are considered equivalent). Particularly, it kinda seems like \(+r\) and \(-r\) should be different points, rather than the ‘same’ infinity. There is probably a resolution to this using <a href="https://en.wikipedia.org/wiki/Oriented_projective_geometry">oriented projective geometry</a>, defining the back half of the sphere as a second copy of \(\bb{C}\) and conjoining the two at \(\infty e^{i \theta} \lra -\infty e^{i\theta}\), but that’s not worth pursuing further here.</p>
<p>Complex analytic functions map the Riemann sphere to itself in some way. For instance, \(z \mapsto \frac{1}{z}\) swaps \(0\) and \(\infty\) and the rest of the sphere comes along for a ride. Powers of \(z\) cause the mapping to be \(m:n\) – so \(z^2\) maps two copies of the sphere to one copy, while \(z^{1/2}\) maps one copy to two copies, hence becoming multi-valued. The <a href="https://en.wikipedia.org/wiki/M%C3%B6bius_transformation">möbius transformations</a>, functions of the form \(\frac{az + b}{cz + d}\) with \(ad-bc \neq 0\), are the invertible holomorphic transformations of the Riemann sphere. They comprise dilations, rotations, reflections, and inversaion of \(\bb{C}\).</p>
<hr />
<h2 id="6-convergence-concerns">6. Convergence concerns</h2>
<p>Although Laurent series capture the properties of complex analytic functions well, they still only work within a definite radius of convergence, given by the distance to the closest pole. Sometimes we can expand around other points to work around this. A common choice is expanding around \(z=\infty\), by creating a series in \(1/z\) instead:</p>
\[\frac{1}{1-z} = \begin{cases}
1 + z + z^2 + \ldots & \| z \| < 1 \\
-\frac{1}{z} - \frac{1}{z^2} - \frac{1}{z^3} - \ldots & \| z \| > 1
\end{cases}\]
<p>Amusingly, you can keep changing the point you expand around to ‘go around’ a pole, producing an analytic continuation outside the radius of the initial Taylor series.</p>
<p>Complex Taylor series diverge for the same reasons that real ones do, but the choices of radius make a lot more sense in complex analysis than they do in real analysis: they are the distance to the closest singularity (for instance, \(\frac{1}{1 + x^2}\) around \(x=0\) has radius of convergence \(r=1\) since there are poles at \(\pm i\)).</p>
<p>The simpelst way to show that a series converges is to show that the series still converges if \(z\) is replaced with \(r = \|z\|\), since</p>
\[f(z) = a_0 + a_1 z + a_2 z^2 + \ldots \leq f(r) = a_0 + a_1 r + a_2 r^2 + \ldots\]
<p>After all, the phases of the \(z\) terms can only serve to reduce the sums of the magnitudes.</p>
<p>We know that geometric series \(1 + x + x^2 + \ldots\) converge only if \(x < 1\). This means that \(\sum a_n r^n\) definitely converges if the terms look like \(\sqrt[n]{\| a_n r^n \|} = \sqrt[n]{\| a_n \| } r \lt 1\), which gives the <a href="https://en.wikipedia.org/wiki/Root_test">root test</a> for convergence:</p>
\[R = \frac{1}{\lim_{n \ra \infty} \sup \sqrt[n]{\| a_n \| }} > 0\]
<p>If \(R = 0\) the series still might converge (depending on what the phases of \(a_n\) do!); if \(r > R\) it definitely doesn’t. If \(R = \infty\) then the series converges for all finite \(\|z\|\) and is called an ‘entire’ function, which is a weird name.</p>
<p>The root test is the most powerful of the simple convergence tests, because it hits exactly on the property that \(\sum \| a_n \| r^n\) converges if it’s less than a geometric sum \(\sum x^n\). Other tests ‘undershoot’ this property; for instance the ratio test says that</p>
\[\| \frac{a_n r^n }{a_{n+1} r^{n+1}} \| = \| \frac{a_n}{a_{n+1}} \| \frac{1}{r} < \| \frac{x^n}{x^{n+1}} \| = 1\]
<p>This captures the idea that the series does converge if its successive ratios are less than that of a geometric series, but fails if the terms look like \(x + x + x^2 + x^2 + x^3 + x^3 + \ldots\) or something.</p>
<hr />
<h2 id="7-global-laurent-series">7. Global Laurent Series</h2>
<p>This is my own idea for making divergence of Laurent series more intuitive.</p>
<p>Laurent series coefficients are derivatives of the function evaluated at a particularly point, like \(f^{(n)}(z=0)\), such that a whole Laurent series is</p>
\[f(z) = \ldots + f^{(-2)}(0) \frac{2! }{z^2} - f^{(-1)}(0) \frac{1!}{z} + f(0) + f^{(1)} z + f^{(2)}(0) \frac{z^2}{2!} + \ldots\]
<p>Suppose that for some reason the Cauchy forms of computing derivatives and ‘inverse’ derivatives are the ‘correct’ way to compute these values:</p>
\[\begin{aligned}
f(0) &= \frac{1}{2\pi i} \oint_{C} \frac{f(z) dz}{z} \\
\frac{f^{(n)}(0)}{n!} &= \frac{1}{2\pi i} \oint_{C} \frac{f(z) dz}{z^{n+1}} \\
(-1)^n n! f^{(-n)}(0) &= \frac{1}{2\pi i}\oint_{C} z^{n-1} f(z) dz \\
\end{aligned}\]
<p>Where \(C\) is a circle of radius \(R\) around \(z=0\). Then some handwaving leads to an alternate characterization of divergent series. For most calculations \(\p_z f(0)\) is independent of the choice of \(C\), but for a function with a pole away from the origin, they are not. Consider \(f(z) = \frac{1}{1-z}\), and let \(C\) be the positively oriented circle of fixed radius \(R\). Then:</p>
\[\begin{aligned}
f_R(0) &= \frac{1}{2\pi i}\oint_{C} \frac{1}{(z)(1-z)} dz \\
&= \frac{1}{2\pi i}\oint_{C} \frac{1}{z} + \frac{1}{1-z} dz \\
&=\text{Res}_C (z=0, \frac{1}{z} - \frac{1}{z-1}) + \text{Res}_C (z=1, \frac{1}{z} - \frac{1}{z-1}) \\
&= 1 - H(R-1) \\
\end{aligned}\]
<p>Where \(H\) is a <a href="https://en.wikipedia.org/wiki/Heaviside_step_function">step function</a> \(H(x) = 1_{x > 0}\). The value of \(f(0)\) changes depending on the radius we ‘measure’ it at. The derivative and integral terms show the same effect, after computing some partial fractions:</p>
\[\begin{aligned}
f_R'(0) &= \frac{1}{2\pi i}\oint_{C} \frac{1}{(z^2)(1-z)} dz \\
&= \frac{1}{2\pi i}\oint_{C} \frac{1}{z} + \frac{1}{z^2} - \frac{1}{z-1} dz \\
&= 1 - H(R-1) \\
f^{(-1)}_R(0) &= \frac{1}{2\pi i}\oint_{C}\frac{- 1}{z-1} dz \\
&=-H(R-1)
\end{aligned}\]
<p>In total we get, using \(H(x) = 1 - H(-x)\):</p>
\[f^{(n)}_R(0) = \begin{cases}
H(1-R) & n \geq 0 \\
- H(R-1) & n < 0
\end{cases}\]
<p>Which gives the ‘real’ Laurent series as:</p>
\[\frac{1}{1-z} = - (\; \ldots + z^{-2} + z^{-1}) H(\|z\| - 1) + (1 + z + z^2 + \ldots) H(1 - \|z\|)\]
<p>The usual entirely-local calculations of \(f'(z)\), etc miss the ‘global’ property: that the derivative calculations fail to be valid beyond \(R=1\), and a whole different set of terms become non-zero, which correspond to expansion around \(z=\infty\).</p>
<p>Which if you ask me is very elegant, and very clearly shows why the radius of convergence of the conventional expansion around \(z=0\) is the distance to the closest pole. Of course it is a bit circular, because to get this we had to choose to use circles \(C\) to measure derivatives, but that’s ok.</p>
<hr />
<p>In summary: don’t use complex numbers. Please use \(\bb{R}^2\) if you can.</p>
The essence of quantum mechanics2020-07-24T00:00:00+00:00http://alexkritchevsky.com/2020/07/24/qm<p>Here’s what I know about QM. I’m trying to learn QFT and it helps to have the prerequisites compressed into the simplest possible representation. It also helps me to write everything down in a compressed form so I can reference it more easily.</p>
<p>This will make no sense if you don’t already have a good understanding of quantum mechanics.</p>
<p>Conventions: \(c = 1\), \(g_{\mu \nu} = diag(+, -, -, -)\). I like to write \(S_{\vec{x}}\) for \(\nabla S\).</p>
<!--more-->
<hr />
<h2 id="1-qm-but-starting-from-the-solutions">1. QM but starting from the solutions</h2>
<p>QM makes a lot more sense to me if you (a) handle everything relativistically from the start and (b) just assume the form of the wave function solutions instead of deriving them. If I had my way I’d start a quantum mechanics course with special relativity, followed by introducing the scalar wave function, like this:</p>
<p>Consider a function on spacetime with the form \(\psi(t, \vec{x}) = e^{ i S(t, \vec{x})/\hbar}\) which assigns a complex phase to every point. It is fully determined by the <strong>action</strong> \(S(\vec{x}, t)\), and in particular given an initial state \(\psi_0\), is determined by the action gradient \(dS = S_{\mu} dx^\mu\). This lets us compare quantum states by integrating over some path \(\Gamma\):</p>
\[\psi(t, \vec{x}) = e^{i/\hbar \int_{\Gamma} dS} \psi_0\]
<p>Later on when potentials are involved we will need to be specific about the path of integration, but for now we can think of \(S\) as a scalar function that determines \(\psi\) everywhere.</p>
<p>Relativistic invariance insists that \(\psi\) have the same value in any reference frame, and \(- i \hbar \p \psi = - i \hbar (\p S) \psi = (S_t, S_{\vec{x}}) \psi\) must be a covariant 4-vector. Contraction with \(\bar{\psi}\) extracts the vector components: \(\< \psi \| {- i \hbar \p} \| \psi \> = \bar{\psi} (S_t, S_{\vec{x}}) \psi = (S_t, S_{\vec{x}})\). Finally, \(\| (S_t, S_{\vec{x}}) \| = \sqrt{S_{t}^2 - S_{\vec{x}}^2}\) must be a Lorentz-invariant scalar.</p>
<p>We call \(i \hbar \p_t = \hat{E}\) and \(- i \hbar \p_x = \hat{P}\) the <strong>energy</strong> and <strong>momentum</strong> operators. The quantum mechanical operators apparently extract properties of \(S\), but because \(S\) is packed inside an exponential, they extract them as eigenvalues: \(i \hbar \p_t \psi = - S_t \psi\). Our quantum-mechanical inner product and our operators are just <em>tools for extracting properties of \(S\)</em>, since \(\psi\) is the only thing we can directly operate on. When an equation like the Schrödinger equation contains a \(\hat{P} = - i \hbar \p_x\) operator, it’s just a skew way of writing the \(p_x\) value.</p>
<p>Since quantum mechanical measurements only happen through operators like these, the exact values of \(\psi\) up to a phase, and therefore \(S\) up to a constant, are not physically observable.</p>
<p>For a free massive spinless particle the action is \(S = - p_{\mu} x^{\mu} = \int -p_\mu dx^{\mu}\), where \(p\) is the four-momentum and \(\| p \| = m\), the rest mass. In the rest frame this is simply \(S = -m \tau = - \int m d\tau\). In the absence of a potential this gives the wave function:</p>
\[\psi(x) = e^{- i/\hbar \int_{0}^{x} p_\mu dx^\mu} \psi(0) = e^{- i/\hbar p_\mu x^\mu} \psi(0) = e^{i/\hbar ( \vec{p} \cdot \vec{x} - E t)} \psi(0)\]
<p>which is a Fourier component with momentum \(p_\mu\). Time evolution via exponentiation of the Hamiltonian amounts to translating in \(t\):</p>
\[\psi(t + \Delta t) = e^{i/\hbar \hat{H} \Delta t} \psi(t) = e^{\Delta t \p_t} \psi(t) = e^{i/\hbar (\vec{p} \cdot \vec{x} - E(t + \Delta t))} \psi_0\]
<p>(This uses the idea that exponentiating a differential operator translates in that coordinate: \(e^{a \p_x} f(x) = f(x+a)\).)</p>
<p>When an initial state is not a pure Fourier mode with a definite momentum, we expand it as a sum of modes. For instance, if at \(t=0\) we measure an electron at \(\vec{x} = 0\), then the initial state is</p>
\[\psi(0, \vec{x}) = \delta(\vec{x}) = \int e^{i \vec{p} \cdot \vec{x}} d \vec{p}\]
<p>When potentials are involved, \(dS\) is modified. The electromagnetic field, for instance, enters as \(p \mapsto p - i qA\), so \(dS = (p_{\mu} - i q A_{\mu}) dx^{\mu}\). Depending on the field configuration we may no longer be able to easily integrate \(\int dS\): if \(A\) includes a current, then it contains a ‘line’ of divergence, and the path integral’s result will depend on how many times \(\Gamma\) circles this divergence. This causes the path integral to give <em>different</em> values based on the choice of path. Summing over these paths, with appropriate weighting, corresponds in QFT to summing over the number of photons that are exchanged (I think. Will work it out in detail when I get to QFT).</p>
<hr />
<h2 id="2-correspondences">2. Correspondences</h2>
<p>Many concepts in quantum mechanics follow naturally from this foundation:</p>
<p><strong>Mass</strong>: For a free particle \(S_t = E\) and \(S_{\vec{x}} = \vec{p}\), and \(m = \sqrt{E^2 - p^2}\) is the relativistic rest energy/momentum relation. The wave function looks like \(\psi = e^{i/\hbar \int \vec{p} \cdot d\vec{x} - E dt} \psi_0\). A high energy/momentum corresponds to a rapidly changing action, and thus to a wave function that is <em>quickly rotating</em> as you translate in time or space. Ultimately, the mass \(m\) corresponds to the speed of phase rotation in a particle’s rest frame, and its energy and momentum are the results of Lorentz-transforming \(dS = - m d\tau\) into other frames.</p>
<p><strong>Path integration</strong>: Relative changes in \(S\) can be found by integrating: \(S(f) - S(i) = \int_{\Gamma} dS\) along any curve \(\Gamma\) from \(i\) to \(f\), and \(\psi(f) = e^{i/\hbar (S(f) - S(i))} \psi(i)\). Thus \(e^{i/\hbar (S(f) - S(i))}\) is the ‘transition matrix’ between any two states, along a given path. The total transition amplitude is a sum over all possible paths between two states. This extends handily to QFT’s path integrals when creation/annihilation of particles is included.</p>
<p><strong>The roles of \(\hbar\) and \(i\)</strong>: \(S \mapsto e^{iS / \hbar }\) is the conversion from ‘action’ space to ‘phase’ space. \(\hbar\) changes units from action (energy \(\times\) time) to radians; if we set \(\hbar = 1\) we are declaring that we measure action in radians. The resulting space after mapping by \(e^{iS}\) is physically meaningful, because in some cases we’ll end up summing these phase factors from multiple starting states and seeing interference patterns. I suspect that the output space is the \(U(1)\) that is identified with the electromagnetic gauge field but am not sure. If so, I think it would be good to write \(R_{EM}\) instead of \(i\), in order to avoid accidentally conflating the \(i\) factors from rotations in different spaces. (Actually that’s my stance on \(i\) in general.)</p>
<p><strong>Angular momentum</strong>: The orbital angular momentum operator, \(\hat{L}_z = -i \hbar \p_{\phi}\), does the same thing as \(\hat{P} = - i \hbar \p_{\vec{x}}\) but for a wave function in spherical coordinates. The azimuthal angle term looks like \(\psi \sim e^{i/\hbar (l_z \phi - E t)}\), and \(\hat{L}_z \psi = l_z \psi\). The azimuthal quantum number \(l_z\) (often written \(m\)) measures how many times \(\psi\) oscillates in a rotation of the polar angle \(\phi\); it is quantized precisely because the \(\phi\) coordinate has a built-in periodicity. A \(z\)-angular momentum value of \(l_z\) labels the number of periods the wave makes as you rotate \(\phi\) in the \(z\)-plane.</p>
<p><strong>Spin-\(\frac{1}{2}\)</strong>: If \(l_z = 1/2\), then \(\psi_{\pm} \sim e^{i/\hbar (\pm \frac{1}{2} \phi - E t)}\) acts like a spinor (by modeling the spin as orbital angular momentum, and omitting the \(r\) and \(\theta\) components). This function appears trivially unphysical, since it has different values at \(\phi = 0\) vs \(\phi = 2 \pi\). The resolution is the fact that it’s only meaningful to use the wave function to <em>compare</em> states that are connected by a path – and for a spinor it’s correct that \(\< \psi(\phi = 2 \pi) \| \psi(\phi = 0) \> = -1\). (This is a useful mental model but isn’t the full story. My next post will be a truly exhaustive exploration of spinors.) (Much later edit: this next post got very difficult for me to finish. Hopefully I can get back to it someday.)</p>
<p><strong>Spin-\(1\)</strong>: A <em>vector</em>-valued wave function \(\vec{\psi} = (\psi_x, \psi_y, \psi_z)\), where the terms transform according to physical rotations, is a spin-1 object. To consider its \(z\)-angular momentum we change bases to a <a href="Spherical_basis">spherical basis</a> (not to be confused with spherical coordinates):</p>
\[(\hat{x}, \hat{y}, \hat{z}) \ra (\frac{\hat{x} + i \hat{y}}{\sqrt{2}}, \hat{z}, \frac{\hat{x} - i \hat{y}}{\sqrt{2}})\]
<p>Or in cylindrical coordinates, using \(\hat{x} = (\cos \phi )\hat{\rho} - (\rho \sin \phi )\hat{\phi}\) and \(\hat{y} = (\sin \phi) \hat{\rho} + (\rho \cos \phi) \hat{\phi}\):</p>
\[= (\frac{e^{i \phi} (\hat{\rho}+ i \rho \hat{\phi})}{\sqrt{2}}, \hat{z}, \frac{e^{- i \phi} (\hat{\rho} - i \rho \hat{\phi})}{\sqrt{2}})\]
<p>The coordinates of \(\vec{\psi}\) in this basis are:</p>
\[(\psi_{+1}, \psi_0, \psi_{-1}) = (\frac{\psi_x - i \psi_y}{\sqrt{2}}, \psi_z, \frac{\psi_x + i \psi_y}{\sqrt{2}})\]
<p>In the new basis, the coordinate vectors have an explicit \(\phi\)-dependence, which captures the idea that any vector-valued function has an <em>intrinisic</em> \(\phi\)-derivative, independent of reference frame, just by virtue of being a vector. (This is kinda obvious in hindsight, but it took me forever to understand.)</p>
<p>So the components of a vector wave function \(\vec{\psi}\) locally looks like \(\psi_{s_z}(\phi, \rho, z, t) \sim e^{i \hbar (s_z + l_z) \phi } \psi_{s_z}(\rho, z, t)\), where \(l_z\) is its orbital angular momentum and \(s_z \in (+1, 0, -1)\) is a frame-independent contribution just from its vectorial nature. The \(0\) component of \(L_z\) spin corresponds to a vector-valued wave function pointing only in the \(z\) direction. \(\pm 1\) components correspond to having \(x\) or \(y\) components, with the sign determined by their relative phase.</p>
<p>Note what it means to have spin \(1\): it’s not that it fixes the <em>value</em> of the angular momentum; rather, it specifies the different ways that the angular momentum can transform under rotation. The three choices determine whether \(\vec{\psi}\) is in the \(z\) direction \((s_z = 0)\) or whether it has a positive or negative ‘rotational’ components in the \(xy\) plane (\(s_z = \pm 1\)). Particularly, having angular momentum \(s_z = +1\) means that the \(y\) component is advanced in phase compared to the \(x\) component. This is why the ‘ladder’ operator \(\hat{S}_+ = \hat{S}_x + i \hat{S}_y\) serves to increase the angular momentum, because it includes a factor of \(e^{i \phi}\):</p>
\[\hat{L}_+ = (\hat{L}_x + i \hat{L}_y) \sim e^{i \phi}\]
<aside class="toggleable" id="angular" placeholder="<b>Aside</b>: Angular momentum calculations <em>(click to expand)</em>">
<p>Here are some calculations I did to make sure I wasn’t lying through my teeth here:</p>
<p>The angular momentum operators are \(\vec{x} \^ \hat{P} = - i \hbar (y \p_z - z \p_y, z \p_x - x \p_z, x \p_y - y \p_x)\), giving:</p>
\[\begin{aligned}
\hat{L}_z \psi = l_z \psi &= - i \hbar (x \p_y - y \p_x) \psi = (x p_y - y p_x) \psi \\
\end{aligned}\]
<p>etc. Another form is \(\hat{L}_z = -i \hbar \p_{\phi}\):</p>
\[\begin{aligned}
\hat{L}_z \psi &= - i \hbar \p_{\phi} \psi \\
&= -i \hbar (x_{\phi} \p_x + y_{\phi} \p_y) \psi \\
&= -i \hbar (-y \p_x + x \p_y) \psi \\
&= (x \hat{P}_y - y \hat{P}_x) \psi
\end{aligned}\]
<p>Thus a function of the form \(\psi = e^{i l_z \phi /\hbar}\) has \(\hat{L}_z \psi = l_z \psi\).</p>
<p>The \(\hat{L}_x\) and \(\hat{L}_y\) operators have less-pleasant forms in spherical coordinates:</p>
\[\begin{aligned}
\hat{L}_x &= -i \hbar ({- \sin} (\phi) \p_{\theta} - \cot(\theta) \cos (\phi )\p_{\phi}) \\
\hat{L}_y &= -i \hbar (\cos (\phi) \p_{\theta} - \cot(\theta) \sin (\phi )\p_{\phi}) \\
\end{aligned}\]
<p>The failure of commutation, such as \([\hat{L}_x, \hat{L}_z] \neq 0\), comes from the fact that this adds \(\phi\)-dependencies that will affect the \(l_z\) value.</p>
<p>Now look at the raising operator \(L_+\):</p>
\[\begin{aligned}
L_{+} &= L_x + i L_y \\
&= -i \hbar ((-\sin \phi + i \cos \phi) \p_{\theta} - \cot(\theta) (\cos \phi + i \sin \phi) \p_{phi})\\
&= -i \hbar (e^{i \phi} )(i \p_{\theta} - \cot(\theta) \p_{\phi})
\end{aligned}\]
<p>Ignoring the coefficient this produces (I’m told it’s \(\hbar \sqrt{j(j+1) - l_z (l_z+1)}\).), the reason that it raises the \(l_z\) value is the inclusion of an \(e^{i \phi}\) term, giving \(e^{i \phi} e^{i l_z \phi} = e^{i (l_z + 1) \phi}\).</p>
<p>A constant vector function is given by (in somewhat more pleasant cylindrical coordinates \((\rho, \phi, z)\)):</p>
\[\begin{aligned}
\vec{\psi} &= \psi_x \hat{x} + \psi_y \hat{y} + \psi_z \hat{z} \\
&= \frac{1}{2} (\psi_x - i \psi_y)(\hat{x} + i \hat{y}) + \psi_z \hat{z} + \frac{1}{2} (\psi_x + i \psi_y) (\hat{x} - i \hat{y}) \\
&= \frac{1}{2} \psi_{+1} e^{i \phi} (\hat{\rho}+ i \rho \hat{\phi}) + \psi_0 \hat{z} + \frac{1}{2} \psi_{-1} e^{- i \phi} (\hat{\rho} - i \rho \hat{\phi})
\end{aligned}\]
<p>Clearly \(\hat{L}_z (\psi_{+1}, \psi_0, \psi_{-1}) = (+1 \psi_{+1}, 0 \psi_{0}, -1 \psi_{-1})\).</p>
</aside>
<p>By the way, photons are spin-1 particles, but cannot have the \(s_z = 0\) state for what I currently understand as ‘complicated technical reasons’. Roughly, it goes: because photons have no rest frame, the \(s_z = 0\) value is forbidden, as that would imply that there is a choice of \(z\) around which a photon wave function is symmetric. The remaining \(s_z = \pm 1\) states correspond to photon polarizations.</p>
<p><strong>The Schrödinger Equation</strong>: We can write \(S_t^2 - S_{\vec{x}}^2 = m^2\) as \(S_t = \sqrt{m^2 + S_x^2} = m \sqrt{1 + \frac{S_x^2}{m^2}}\) and expand as a Taylor series (note that \(\| S_x/m \| = \| p / m \| \ll 1\)) to get:</p>
\[S_t = m (1 + \frac{1}{2} \frac{S_x^2}{m^2} + O((\frac{S_x^2}{m^2})^2) \approx m + \frac{S_x^2}{2m}\]
<p>Using our operator forms we get the free-particle Schrödinger equation:</p>
\[\hat{E} \psi \approx (m + \hat{P}^2/2m) \psi\]
<p>Interpreting, this says that the time-derivative of the action is a constant (the mass) plus a term proportional to the kinetic energy, plus higher-order terms that vanish at low momentums.</p>
<p>The initial \(m\) term is normally ignored in non-relativistic QM. It corresponds to a constant change in phase along any path (and adds a constant term to the Lagrangian), but it drops out of any calculation if you (a) only integrate over time and (b) never create/annihilate particles.</p>
<p><strong>Schrödinger with potential</strong>: The \(V\) term in the non-relativistic Schrödinger ends up next to the kinetic energy term: \(\hat{E} \psi \approx (m + \hat{P}^2/2m + V) \psi\). Working backwards through the derivation, we figure that the constraint on \(S\) must be: \(S_t - V = \sqrt{m^2 + S_x^2}\). But there is no particular reason this would have a clean relativistic form, since we treat our potential non-relativistically anyway.</p>
<p>Nevetheless we can add to our interpretation: the role of a classical scalar potential \(V\) is to modify the phase change as a wave function translates in time, such that the particle acts like it has energy \(E - V\) instead of \(E\). The role of a vector potential is to modify the momentum, \(\vec{p} \mapsto \vec{p} - \vec{A}\).</p>
<p>The electromagnetic field uses the 4-potential \(q A = q (\phi, \vec{A})\). The electromagnetic wave function is something like \(\psi = e^{i/\hbar \int (\vec{p} - q \vec{A}) \cdot d\vec{x} - (E - q \phi)]dt } = e^{i/\hbar \int (p_{\mu} - q A_{\mu}) dx^{\mu}}\).</p>
<p><strong>Covariant Derivatives</strong>:</p>
<p>Given the electromagnetic wave function of the form \(\psi = e^{i/\hbar \int (p_{\mu} - q A_{\mu}) dx^{\mu}}\), we can extract the \(p_{\mu}\) term with a more involved derivative operator, the ‘covariant derivative’ \(D_{\mu} = \p_{\mu} + i q A_{\mu}\), or equivalently, modifying the moment operator to be \(\hat{P}_{\mu} = \hat{p}_{\mu} + \hbar q A_{\mu}\):</p>
\[- i \hbar D_{\mu} \psi = - i \hbar (\p_{\mu} + i q A_{\mu}) \psi = p_{\mu} \psi\]
<p>This derivative manages to extract the \(p_{\mu}\) term by itself by subtracting off the \(qA\) contribution.</p>
<p><strong>Gauge transformations</strong>:</p>
<p>Since physics is determined by an action integral like \(\int( p_\mu - q A_\mu )dx^\mu\), any integrable (exact) form (some \(\Lambda\) with \(d \Lambda = 0\)) can be added to \(A\) and will only affect the action in a path-independent way:</p>
\[\int_i^f (p_\mu - q A_\mu + \Lambda_\mu) dx^\mu= P_i^f - \Lambda_i^f - q \int_i^f A_\mu dx^\mu\]
<p>The covariant derivative is so called because it produces a derivative operator, and thus a momentum operator, which respects this gauge-freedom by removing any explicit dependence on the value of \(A\). In my opinion, though, this is a very roundabout way to reach the conclusion: the explicit purpose of \(\hat{P}\) is to extract the value of \(p\), which is ultimately the thing that must obey \(p_{\mu} p^{\mu} = m^2\); the specific method of removing the gauge freedom is an implementation detail.</p>
<p>This performs a gauge transform that doesn’t affect the relative amplitudes of different paths between \(i \ra f\) – only the resulting phase. As such there is no way to observe this effect in a closed system, so the addition of \(d \Lambda\) is a free variable in the theory. However, it turns out to be important when considering interacting systems, in ways that I haven’t learned yet but will be essential in QFT.</p>
<p><strong>The Lagrangian</strong>: The integral \(\Delta S = \int dS\) can be parameterized by time as</p>
\[\Delta S = \int (S_{\vec{x}} \cdot d\vec{x}/dt - S_t) dt = \int L \, dt\]
<p>\(L = dS / dt\) is the source of the (single-particle) Lagrangian, and is where the elementary form \(L = T - V\) comes from. For a free particle, \(L dt = -m d\tau\), and \(\Delta S = - \int m d \tau\). In a classical scalar potential with \(S_t = E = T + V\):</p>
\[L = S_{\vec{x}} \cdot d\vec{x}/dt - S_t = \vec{p} \cdot \vec{v} - E\]
<p>In classical mechanics often \(E = T + V\) and \(\vec{p} \cdot \vec{v} = 2 T\), giving</p>
\[L = 2 T - (T + V) = T - V\]
<p>Regardless of how we parameterize \(S = \int dS\), applying stationary-action will give the classical trajectory. Feynman’s classic explanation of this is that all but the ‘stationary’ path – the choice of \(\Gamma\) such that \(\delta S / \delta \Gamma \vert_{\Gamma} = 0\) – will exhibit destructive interference in the macroscopic limit, resulting in the laws of classical physics. Quantitatively, this means that in the classical limit as \(\hbar \ra 0\), the path integral is dominated by the minimal path:</p>
\[\begin{aligned}
\lim_{\hbar \ra 0} \int d\Gamma e^{i S[\Gamma] /\hbar}
&= \lim_{\hbar \ra 0} \int d (\Delta \Gamma) e^{i S[\Gamma_{\text{min}} + \Delta \Gamma] /\hbar} \\
&\sim \lim_{\hbar \ra 0} e^{i S[\Gamma_{\text{min}}] /\hbar }
\end{aligned}\]
<p>I don’t exactly know how to make that rigorous but it makes heuristic sense: as \(\hbar \ra 0\) the function oscillates infinitely rapidly, cancelling itself out in the integral over \(\Delta \Gamma\), but at least the minimal path, where \(\delta S / \delta \Gamma = 0\), oscillates less than the rest do. We could imagine expanding \(S\) as a Taylor series \(S = S[\Gamma_{\text{min}}] + (\delta S / \delta \Gamma) \delta \Gamma + \ldots\), but I really don’t know if that’s allowed.</p>
<p><strong>Noether’s Theorem</strong>: Suppose there is some dynamical variable \(q\) that \(S\) depends on. Then we can locally approximate \(\Delta S = S(q, \ldots) + S_q \Delta q\), adding a phase to the wave function \(\psi \ra e^{i S_q \Delta q /\hbar} \psi\). This leaves physics unchanged if and only if \(S_q\) is a constant, such that this is a uniform global phase transformation.</p>
<p>But if \(q\) is a physical symmetry of the system, then it <em>must</em> lead to the same physics; therefore \(S_q\) is a constant throughout the system’s evolution (gauge fields notwithstanding). \(S_q\) is called the ‘Noether charge’ corresponding to the \(q\) symmetry. \(E\) is the charge associated with \(t\); \(\vec{p}\) for \(\vec{x}\), \(\vec{L}\) for \(\vec{\theta}\), etc.</p>
<hr />
<h2 id="summary">Summary</h2>
<ol>
<li>QM is easier to follow if you start from the fact that the wave function has the form \(\psi = e^{i S/\hbar}\).</li>
<li>Operators and inner products are ways to extract properties of \(S\).</li>
<li>The Schrödinger equation for a free particle is a low-energy approximation of the statement that \(\| \p S \| = m\).</li>
<li>The only free physical quantity in a wave function is the 4-vector \(\p S\), which measures which part of the variation in \(S\) is in the spatial vs timelike direction.</li>
<li>Potentials enter by modifying \(\p S\), eg \(\p S \mapsto \p S - i q A\). \(\int_i^f dS = S(f) - S(i)\) may no longer hold depending on the properties of \(A\).</li>
<li>Intrinsic angular momentum is a property of what kind of object the wave function’s value is (scalar, vector, spinor, etc).</li>
</ol>
<p>Normally you have to unlearn QM to learn relativistic QM, but the relativistic version makes much more sense in the first place so why not start there?</p>
<hr />
<p>Next up, spinors.</p>
<p>Much-later edit: spinors were harder than I thought :(</p>
A possible derivation of the Born Rule?2019-12-22T00:00:00+00:00http://alexkritchevsky.com/2019/12/22/many-worlds<p>I think that the Many-Worlds Interpretation (MWI) of quantum mechanics is probably ‘correct’. There is no reason to think that the rules of atomic phenomena would stop applying at larger scales when an experimenter becomes entangled with their experiment.</p>
<p>However, MWI has the problem (shared with all the other mainstream interpretations of QM) that it does not explain why quantum randomness leads to the probabilities that we observe. The so-called <a href="https://en.wikipedia.org/wiki/Born_rule">Born Rule</a> says that if a system is in a state \(\alpha \| 0 \> + \beta \| 1 \>\), upon ‘measurement’ (in which we entangle with one or the other outcome), we measure the eigenvalue associated with the state \(\| 0 \>\) with probability</p>
\[P[0] = \| \alpha \|^2\]
<p>The Born Rule is normally included as an additional postulate in MWI, and this is somewhat unsatisfying. Or at least, it is apparently difficult to justify, given that I’ve read a bunch of attempts, each of which talks about how there haven’t been any other satisfactory attempts. I think it would be unobjectionable to say that there is not a consensus on how to motivate the Born Rule from MWI without any other assumptions.</p>
<p>Anyway here’s an argument I found that I find somewhat compelling. It argues that the Born Rule can emerge from interference if you assume that every <em>measurement</em> of a probability that you’re exposed to (which I guess is a Many-Worlds-ish idea) is assigned a random, uncorrelated phase.</p>
<!--more-->
<hr />
<h2 id="1-classical-measurements-of-probability">1. Classical measurements of probability</h2>
<p>First let’s discuss a toy example of ‘measuring a probability’ in a non-quantum experiment. Suppose we’re flipping a biased coin that gets heads with probability \(P[H] = p\) and \(P[T] = q = 1 - p\). We’ll write it in a notation suggestive of quantum mechanics.: let’s call its states \(\| H \>\) and \(\| T \>\), so the results of a coin flip are written as \(p \| H \> + q \| T \>\) with \(p + q= 1\). Upon \(n\) iterations of classical coin-flipping we end up in state</p>
\[(p \| H \> + q \| T \>)^n = \sum_k {n \choose k} p^k q^{n-k} \| H^k T^{n-k} \>\]
<p>Where \(\| H^k T^{n-k} \>\) means a state in which we have observed \(k\) heads and \(n-k\) tails in any order.</p>
<p>Now suppose this whole experiment is being performed by an experimenter who’s trapped in a box or something. The experimenter does the experiment, writes down what they think the probability of heads is, and then transmits <em>only that</em> to us, outside of the box. So the only value we end up seeing is the value of their <em>measurement</em> of \(P[H] = p\), which we’ll call \(\hat{P}[H]\). The best estimate that the experimenter can give, of course, is their observed frequency \(\frac{k}{n}\), so we might say that the resulting system’s states are now identified by the probability perceived by the experimenter:</p>
\[(p \| H \> + q \| T \>)^n = \sum_k {n \choose k} p^k q^{n-k} \| \hat{P}[H] = k/n\>\]
<p>If you let \(n\) get very large, the states near where \(\hat{P}[H] = p\) will end up having the highest-magnitude amplitude, and so we expect to end up in a ‘universe’ where the measurement of the probability \(p\) converges on the true value of \(p\). This is easily seen, because for large \(n\) the binomial distribution \(B(n, p, q)\) converges to a normal distribution \(\mathcal{N}(np, npq)\) with mean \(np\). So, asymptotically, the state \(\| \hat{P}[H] = \frac{np}{n} = p \>\) becomes increasingly high-amplitude relative to all of the others. This is a way of phrasing the law of large numbers.</p>
<p>I think this is as good an explanation as any as to what probability ‘is’. Instead of trying to figure out what it means for <em>us</em> to experience an infinite number of events and observe a probability, let’s just ask an experimenter who’s locking in a box to figure it out for us, and then just have them send us their results! Unsurprisingly, the experimenter does a good job of recovering classical probability.</p>
<hr />
<h2 id="2-the-quantum-version">2. The quantum version</h2>
<p>Now let’s try it for a qubit (a ‘quantum coin’). The individual experiment runs are now given by \(\alpha \| 0 \> + \beta \| 1 \>\) where \(\alpha, \beta\) are probability amplitudes with \(\| \alpha \|^2 + \| \beta \|^2 = 1\). Note that normalizing these to sum to 1 is just for convenience and doesn’t predetermine the probabilities – if you don’t normalize now, you just have to divide through by the normalization later instead.</p>
<p>As before we have our experimenter perform \(n\) individual measurements of the qubit and report the results to us:</p>
\[(\alpha \| 0 \> + \beta \| 1 \>)^n\]
<p>Where are things going to go differently? If we imagine our experimenter as a standalone quantum system, it seems like their measurements may pick up their own phases and possibly interfere with each other. That is, a single \(\| P = \frac{k}{n} \>\) macrostate, consisting of all the different ways they could have gotten \(k\) \(\| 1 \>\)s out of \(n\) measurements, will consist of many different ‘worlds’ that may end up with different phases themselves, and there is no reason to think that they will add up neatly. I’m not totally sure this is reasonable, but it leads to an interesting result, so let’s assume it is.</p>
<p>For an example, consider the \(n=2\) case. We’ll let each \(\| 0 \>\) state have a different phase \(\alpha_j = \| \alpha \| e^{i \theta_j}\). (We can ignore the \(\| 1 \>\) phase without loss of generality by treating it as an overall coefficient to the entire wave function.)</p>
<p>The state we generate will be:</p>
\[\begin{aligned}
&(\alpha_1 \| 0 \> + \beta \| 1 \>) (\alpha_2 \| 0 \> + \beta \| 1 \>) \\
&= \alpha_1 \alpha_2 \| 0 0 \> + \alpha_1 \beta \| 0 1 \> + \beta \alpha_2 \| 1 0 \> + \beta^2 \| 1 1 \> \\
\end{aligned}\]
<p>This is no longer a clean binomial distribution. Writing \(a = \| \alpha \|\) and \(b = \| \beta \|\) for clarity, the two-iteration wave function is:</p>
\[= e^{i (\theta_1 + \theta_2) } a^2 \| 0^2 \> + ab (e^{i \theta_1} + e^{i \theta_2}) \| 0^1 1^1 \> + b^2 \| 1^2 \>\]
<p>Note that \(ab (e^{i \theta_1} + e^{i \theta_2}) \| 0^1 1^1 \>\) only has the same magnitude as \(2ab \| 0^1 1^1 \>\), the classical value, when \(\theta_1 = \theta_2\).</p>
<p>This suggests that, if the experimenter’s different experiment outcomes can randomly interfere with each other as quantum states, then the probability of their reporting \(\| 0^1 1^1 \>\) will be suppressed compared to \(\| 0^2 \>\) or \(\| 1^2 \>\).</p>
<hr />
<h2 id="3-random-walks-in-state-space">3. Random Walks in State Space</h2>
<p>Now we consider what this looks like as \(n \ra \infty\).</p>
<p>For a state with \(k\) \(\alpha\| 0 \>\) terms, we end up with a sum of exponentials with \(k\) phases in them:</p>
\[E_{k, n} = \sum_{J \in S_{k,n}} e^{i \sum_{j \in J} \theta_j}\]
<p>Here \(S_{k,n}\) is the set of \(k\)-element subsets of \(n\) elements. For instance if \(k=2, n=3\):</p>
\[E_{2, 3} = e^{i(\theta_1 + \theta_2)} + e^{i(\theta_2 + \theta_3)} + e^{i(\theta_1 + \theta_3)}\]
<p>Our wave function for \(n\) iterations of the experiment is given by</p>
\[\psi = \sum_k a^k b^{n-k} E_{k, n} \| 0^k 1^{n-k} \> = \sum_k a^k b^{n-k} E_{k, n} \| \hat{P}[0] = \frac{k}{n} \>\]
<p>The classical version of this is a binomial distribution because \(E_{k, n}\) is replaced with \({n \choose k}\). The quantum version observes some cancellation. We want to know: as \(n \ra \infty\), what value of \(k\) dominates?</p>
<p>We don’t know anything the phases themselves, so we’ll treat them as classical independent random variables (which turns out to be the key assumption here). This means that \(\bb{E}[e^{i \theta}] = 0\) and therefore \(\bb{E}[E_{k, n}] = 0\) for all \(k\). But the expected <em>magnitude</em> is not 0. The sum of all of these random vectors forms a random walk in the complex plane, and the expected amplitude of a random walk is <a href="http://mathworld.wolfram.com/RandomWalk2-Dimensional.html">given</a> by \(\bb{E}[ \| E_{1, n} \|^2 ] = n\).</p>
<p>Brief derivation: this comes from the fact that</p>
\[\begin{aligned}
\bb{E}[ \| E_{1, n} \|^2 ] &= \bb{E} [ \sum_i e^{- i \theta_i} \| \sum_j e^{i \theta_j} ] \\
&= \bb{E} \sum_i \| e^{i \theta_i} \|^2 + \bb{E} \sum_{i \neq j} e^{- i \theta_i} e^{i \theta_j} \\
&= n \bb{E}[1] + \sum_{i \neq j} \bb{E}[e^{i (\theta_i - \theta_j)}] \\
&= n
\end{aligned}\]
<p>This means that the magnitude of the \(k=1\) term for our quantum coin is proportional to \(\sqrt{n}\), rather than the classical value of \(n\).</p>
<p>For \(k > 1\), the same argument applies (it’s still basically a random walk), except that there are \({ n \choose k }\) terms in the sum, so in every case we get an expected amplitude \(\bb{E} [ \| E_{k, n} \|^2 ] = { n \choose k }\). This makes the resulting experimenter wave function look like:</p>
\[\begin{aligned}
(e^{\hat{\theta} i} \alpha \| 0 \> + \beta \| 1 \>)^n
&\sim \sum_{k =0}^n \sqrt{ n \choose k } a^k b^{n-k} \| 0^k 1^{n-k} \text{ in some order }\> \\
&\sim \sum_{k =0}^n \sqrt{ n \choose k } a^k b^{n-k} \| \hat{P}[ 0 ] = k/n \>
\end{aligned}\]
<p>(This is not an equality because it still depends on a classical random variable \(\hat{\theta}\). But it produces the correct expected magnitudes for each term, which is what we care about.)</p>
<hr />
<h2 id="4-the-born-rule">4. The Born Rule</h2>
<p>After running \(n\) experiments in their box, our experimenter tell us a number, their perceived value of \(P[\| 0 \>]\). As \(n \ra \infty\) the highest-amplitude state will dominate. For that, we only need to compute the value of \(k\) at the peak amplitude, and we can find that using \(\| \psi \|^2\), which is easy to work with:</p>
\[\| \psi \|^2 \sim \sum {n \choose k} (a^2)^k (b^2)^{n-k}\]
<p>This is a binomial distribution \(B(n, a^2, b^2) = B(n, \|\alpha\|^2, \| \beta \|^2)\), which asymptotically looks like a normal distribution \(\mathcal{N}(n \| \alpha \|^2, n \| \alpha \|^2 \| \beta \|^2)\) with maximum \(k = n \| \alpha \|^2\), which means that the highest-amplitude state measures is:</p>
\[\begin{aligned}
\| \hat{P}[0]= \frac{n \| \alpha \|^2}{n} \> = \| \hat{P}[0] = \| \alpha \|^2 \>
\end{aligned}\]
<p>Thus we conclude that the observed probability of measuring \(\| 0 \>\) when interacting with a system in state \(\alpha \| 0 \> + \beta \| 1 \>\) is centered around \(\| \alpha \|^2\), as reported by an experimenter in a box who runs the measurement many times and reports their measurement of the probability afterwards. And that’s the Born Rule.</p>
<p>Ultimately this follows from postulating that many different ways of seeing the same result interfere with each other, suppressing the amplitudes of seeing less uniform results by a factor of the square root of their multiplicity.</p>
<p>So that’s interesting. I find the argument to be suspiciously clean, and therefore compelling.</p>
<p>As far as I can tell it also works in generalizations of the same setup:</p>
<ul>
<li>to distributions with more than two possible values.</li>
<li>to ‘nested’ experiments, where you find out the value of a measurement from multipler measures who each got it from multiple experimenters. In this case all of the measurers are able to interfere with each other, from your perspective, so it gets flattened out to a single level of interference.</li>
<li>if the amplitues aren’t normalized to begin with. If \(\|\alpha \|^2 + \| \beta \|^2 \neq 1\) the resulting asymptotic normal distribution will just end up having mean \(\frac{n \| \alpha \|^2}{\| \alpha \|^2 + \| \beta \|^2}\).</li>
</ul>
<p>I’m not sure I’ve correctly identified what might actually lead to the random interference in this experiment. Is it the experimental apparatus interfering with itself? Is it hidden degrees of freedom in the experiment itself? Or maybe it’s all of reality, from the point of view of an observer trying to make sense of all historical evidence for the Born Rule. And it’s unclear to me how carefully isolated an experiment would have to be for different orderings of its results to interfere with each other. Presumably the answer is “a lot”, but what if it isn’t?</p>
<p>If this is actually how nature works, I wonder if it’s detectable somehow. What if you could isolate a particular experiment so much that you could suppress the interference of histories. Can you get the probabilities to become proportional to \(\| \alpha \|\)? Or maybe there is some measurable difference between the distribution of probabilities resulting from a random walk, compared to the normal distribution in classical probability? After all a “squared normal distribution” seems like it would fall off faster than a regular one.</p>
<p>Suffice to say I would love to know a) what’s wrong with this argument, or b) if it exists or has been debunked in the literature somewhere, cause I haven’t found anything (although admittedly I didn’t look very hard).</p>
Fourier Transforms via magic2019-11-26T00:00:00+00:00http://alexkritchevsky.com/2019/11/26/magic-derivatives\[\gdef\F#1{\mathcal{F}[#1]}\]
<p>A while ago I found a series of papers which do some weird stuff with derivative operators:</p>
<ol>
<li><a href="https://arxiv.org/abs/1404.0747">New Dirac Delta function based methods with applications to perturbative expansions in quantum field theory</a> by Kempf/Jackson/Morales, 2014</li>
<li><a href="https://arxiv.org/abs/1507.04348">How to (Path-) Integrate by Differentiating</a> also by Kempf/Jackson/Morales, 2015</li>
<li><a href="https://arxiv.org/abs/1610.09702">Integration by differentiation: new proofs, methods and examples</a> by Jia/Tang/Kempf, 2016</li>
</ol>
<p>The general theme is: evaluating functions on derivative operators \(f(\p)\), and applying this to delta functions \(f(\p_x) \delta(x)\), is occasionally useful and can give weird alternate characterizations of the Fourier transform and can be used to efficiently solve integrals.</p>
<p>The authors are physicists, unsurprisingly, and I’m sure there are a bunch of reasons why these results are either not that surprising or surprising-yet-not-useful, but I found them remarkable. But the whole thing is confusing and hard to make sense of. Here’s a… totally different take, in which I rederive the main result by poking around.</p>
<p>tldr: the Fourier transform of \(f(x, \p_x)\) is \(f(i \p_k, -ik) 2 \pi \delta(k)\), whatever that means.</p>
<!--more-->
<hr />
<h2 id="1">1.</h2>
<p>First let’s fix a Fourier transform convention:</p>
\[\hat{f}(k) = \F{f(x)} = \hat{f}(k) = \int f(x) e^{- ik x} dx\]
\[f(x) = \mathcal{F}^{-1}[\hat{f}(k)] = \frac{1}{2 \pi} \int \hat{f}(k) e^{ ik x} dk\]
<p>I prefer not to use the ones that put \(2 \pi\) in the exponent because it’s distracting.</p>
<p>Here are a few common Fourier transform formulas in this convention, for reference:</p>
\[\begin{aligned}
\F{1} &= 2 \pi \delta(k)\\
\F{\delta(x)} &= 1 \\
\F{\p_x^n f(x)} &= (ik)^n \hat{f}(k) \\
\F{x^n f(x)} &= (i \p_k)^n \hat{f}(k)
\end{aligned}\]
<hr />
<h2 id="2">2</h2>
<p>It is common to take Fourier transforms of operators acting on functions, like \(\F{\p_x f(x)} = ik \hat{f}(k)\), in order to solve differential equations. This can be computed using integration by parts inside the transform:</p>
\[\F{\p_x f(x)} = \int \p_x f(x) e^{-ikx} dx = - \int f(x) \p_x e^{-ikx} dx = (ik) \F{f(x)}\]
<p>It seems plausible to use the same argument to Fourier-transform a “freestanding” derivative operator, like \(\p_x\):</p>
\[\F{\p_x} = \int \p_x e^{-ikx} dx = (-ik) \F{1} = (-ik) 2 \pi \delta(k)\]
<p>I find this compelling, because it seems to work. Note that the minus sign is related to integration by parts. We might rewrite \(\p_x f(x)\) as an operator \(- f(x) \p_x\). These are different in general, but under an integral where the boundary vanishes they are the same, which is an assumption we’ll make liberally.</p>
<p>We can also find \(\F{x}\) by rewriting it as a derivative in \(k\):</p>
\[\F{x} = \int x e^{-ikx} dx = \int (i \p_k) e^{-ikx} dx = (i \p_k) \F{1} = 2 \pi i \delta'(k)\]
<p>Armed with these, we may transform any function in \(\p_x\) or \(x\) which has a Taylor series:</p>
\[\begin{aligned} \F{f(\p_x)} &= \int f(\p_x) e^{-ikx} dx = \int f(-ik) e^{-ikx} dx = f(-ik) (2 \pi \delta(k)) \\
\F{f(x)} &= \int f(x) e^{-ikx} dx = \int f(i \p_k) e^{-ikx} dx = f(i \p_k) (2 \pi \delta(k)) \end{aligned}\]
<p>Or even both at once, as long as we are careful with what all of the operators act on:</p>
\[\F{f(x, \p_x)} = f(i \p_k, -ik) 2 \pi \delta(k)\]
<p>In this expression, the \(\p_x\) and \(\p_k\)s are acting to the right, <em>not</em> on internal members of the expression. If they act on an internal member they pick up a minus sign, like we saw above:</p>
\[\F{\p_x f(x)} = \F{- f(x) \p_x} = f(i \p_k) (ik)\]
<hr />
<p>All of this mostly seems to work if we allow negative powers of \(x\) and \(\p_x\) also, but there is some funny business around integration bounds.</p>
\[\begin{aligned}
\F{1/x} &= \int \frac{1}{x} e^{-ikx} dx \\
&= \int (i \p_k)^{-1} e^{-ikx} dx \\
&= (i \p_k)^{-1} \int e^{-ikx} dx \\
&= (i \p_k)^{-1} 2 \pi \delta(k) \\
&= -2 \pi i (\theta(k) + c)
\end{aligned}\]
<p>What value of \(c\) should be used? To get the same value as Wikipedia’s table of Fourier transforms it should be \(c = \frac{1}{2}\). This makes \(\theta(k) + \frac{1}{2}\) an odd function, which means it is a choice that \(\frac{1}{x} \vert_{x=0} = 0\). This seems somewhat arbitrary, and I suspect that one could get away with just not choosing at all. If we do use \(c = \frac{1}{2}\), we get:</p>
\[\F{1/x} = -2 \pi i (\theta(k) + \frac{1}{2}) = - i \pi \sgn(k)\]
<p>The other direction is simpler:</p>
\[\F{\p_x^{-1}} = \int \p_x^{-1} e^{-ikx} dx = \frac{1}{ik} \F{1} = \frac{1}{ik}2 \pi \delta(k)\]
<p>In summary we have a rough hand-waving method for – well, maybe not for rigorously deriving Fourier transforms, but at least for guessing at them, for derivative operators can be written as Laurent series (Taylor series with negative powers). Just swap \(f(x, \p_x) \ra f(i \p_k, - ik)\).</p>
<p>In a sense this is a quarter rotation in the \((x, \p_x)\) plane, followed by multiplying by \(i\) and relabeling \(x \ra k\). That is:</p>
\[\begin{pmatrix} k \\ \p_k \end{pmatrix} = i R \begin{pmatrix} x \\ \p_x \end{pmatrix}\]
<p>I don’t now what it means to rotate in the \((x, \p_x)\) plane, but it turns out that you can do <a href="https://en.m.wikipedia.org/wiki/Linear_canonical_transformation">other</a> linear transformations in this plane as well – fractional rotations, or arbitrary matrices. Incidentally, the Laplace transform is \((t, \p_t) \ra (-\p_s, -s)\), although the two-sided transform is better behaved than the more common one-sided transform. The one-sided version produces a bunch of integration bounds \(f(0)\) and such in the process, which is useful because it’s used for signals that turn ‘on’ at \(t=0\), but not too helpful for understanding as a rotation.</p>
<hr />
<h2 id="3-an-integration-technique">3. An integration technique</h2>
<p>These are the main results of the papers mentioned above, I guess because papers have to justify their existence.</p>
<p>Recall that the integral of a function over the real line is equal to its Fourier transform evaluated at \(0\):</p>
\[\int_{-\infty}^{\infty} g(x) dx = \hat{g}(0)\]
<p>Using our form of \(\hat{g}\) this is:</p>
\[\hat{g}(0) = 2 \pi g(- i \p_k) \delta(k) \vert_{k = 0}\]
<p>This is readily computable:</p>
\[\begin{aligned}
\int_{-\infty}^{\infty} \frac{\sin x}{x} dx &= 2 \pi \frac{e^{-\p_x} - e^{\p_x}}{2 i} \frac{1}{i \p_x} \delta(x) \vert_{x =0} \\
&= \pi [e^{\p_x} - e^{- \p_x}] \theta(x) \vert_{x = 0} \\
&= \pi [ \theta(x + 1) + c - \theta(x-1) - c]_{x = 0} \\
&= \pi [ \theta(x + 1) - \theta(x-1) ]_{x = 0} \\
&= \pi
\end{aligned}\]
<p>This is so much easier and more <em>algebraic</em> than doing a limit and taking the principal value or whatever you normally have to do. As a bonus you get to see the Fourier transform as an intermediate step (it’s \(\pi [ \theta(x + 1) - \theta(x-1)]\))</p>
<p>There are also versions for integrating over finite intervals, doing Laplace transforms, and a bunch of other stuff. I’ll probably write more about them later. There are more tricks – solving bounded integrals, for instance, amounts to evaluating \(\int (\theta(b) - \theta(a)) f(x) dx\), and using the fact that we know the Fourier transform of \(\theta(x)\). Although it is messy: \(\sgn(x)\) has a clean transform, \(\F{\sgn(x)} = \frac{1}{ik}\). Then you solve for \(\theta(x) = \frac{1}{2}(\sgn(x) - 1)\).</p>
<p>Anyway, I wanted to write this up so I don’t forget about it or how to understand it. Hope it’s useful to somebody else.</p>
A brief note about derivatives2019-11-09T00:00:00+00:00http://alexkritchevsky.com/2019/11/09/derivatives<p>A <a href="https://xorshammer.com">blog post</a> led me to a <a href="https://arxiv.org/pdf/1801.09553.pdf">paper</a>, “Extending the Algebraic Manipulability of Differentials”, which makes a useful point about the notation we use for derivatives. This is a brief summary so I don’t forget it.</p>
<p>Observation: the derivative operator \(\frac{d}{dx}\) can be decomposed into two steps: applying the differential operator \(d\) to the target, <em>then</em> dividing by \(dx\). It is useful to think of this as occuring in two separate steps because it removes ambiguity in certain notations and allows algebraic manipulations like \(\frac{dy}{dx} \frac{dx}{dt} = \frac{dy}{dt}\) to work on higher derivatives.</p>
<p>Being precise about what \(d\) acts on, we compute the expansion of \(\frac{d^2 y}{dx^2}\):</p>
\[\frac{d^2 y}{dx^2} = \frac{d( y_x dx )}{dx^2} = \frac{ y_{xx} dx^2 + y_x d^2 x}{dx^2} = y_{xx} + y_x \frac{d^2 x}{dx^2} \tag{1}\]
<!--more-->
<p>\(\frac{d^2 x}{dx^2}\) isn’t \(0\) here. It’s just the division of two objects, \(d^2 x = d(d(x))\) and \((dx)^2\). Keeping it around allows chain rule-like algebraic manipulations to actually work on higher derivatives:</p>
\[\frac{d^2 y}{dt^2} = \frac{d^2 y}{dx^2} \frac{dx^2}{dt^2} = y_{xx} \frac{dx^2}{dt^2} + y_x \frac{d^2 x }{dt^2}\]
<p>Seems useful. You can also just expand the numerator as \(d^2 y = d(dy) = d (y_x dx) = y_{xx} dx^2 + y_x d^2 x\) to get the same result.</p>
<p>Compare (1) to \((\frac{d}{dx})^2 y\), which is the ‘complete’ second derivative, applying the derivative operator to the whole thing twice:</p>
\[\begin{aligned}
(\frac{d}{dx})^2 y &= \frac{d(\frac{dy}{dx})}{dx} \\
&= \frac{\frac{d^2y}{dx} - dy \frac{d^2 x}{dx}}{dx} \\
&= \frac{d^2 y}{dx^2} - y_x \frac{d^2 x}{dx^2} \\
&= [y_{xx} + y_x \frac{d^2 x}{dx^2}] - y_x \frac{d^2 x}{dx^2} \\
&= y_{xx}
\end{aligned}\]
<p>Distinguishing between \((\frac{d}{dx})^2 y\) and \(\frac{d^2 y}{dx^2}\) keeps you out of trouble here.</p>
<p>The paper also uses this notation to show a succinct derivation of a formula for inverting second derivatives (in slightly different notation):</p>
\[\frac{d^2 y}{dx^2} = - \frac{d^2 x}{dy^2} \big( \frac{dy}{dx} \big)^3\]
<p>The authors say that they and their reviewers initially thought this might have been a new discovery. In fact it can be found on <a href="https://en.wikipedia.org/wiki/Inverse_functions_and_differentiation#Higher_derivatives">Wikipedia</a>, but it’s definitely not that well-known! (Edit: they corrected themselves in a later version of the paper, I think, as it doesn’t seem to say that anymore.)</p>
Aphantasia (and how not to do math in your head)2019-09-15T00:00:00+00:00http://alexkritchevsky.com/2019/09/15/mental-math<p>Most of our descriptions of how our brains work are fundamentally <em>vague</em>. We speak of our brains performing verbs like “think”, “realize”, “forget”, or “hope” but we aren’t talking about what’s going on mechanically to result in those qualities.</p>
<p>Sure, these can all be assigned truth values, in the sense that if everyone generally agrees that someone ‘realized’ something, we might define their brain to have performed the objective act of ‘realization’. But this gives no <em>technical</em> understanding of what the process of realization is – beyond, perhaps, some hand-wavey story about connections being bridged between neurons.</p>
<p>So, sometime in the last few years the English-speaking Internet <a href="https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=aphantasia">became aware</a> of the condition called <a href="https://en.wikipedia.org/wiki/Aphantasia">aphantasia</a>. Aphantasia is when a person is unable to picture images in their thoughts – they don’t have a “mind’s eye” at all.</p>
<p>This is interesting because, in contrast to the above, aphantasia is a concrete description of how the brain works. Some people see an image in their head when they draw or recall something; others don’t. Their brains work in materially different ways. I would have no idea how to figure out if two people “realize” something via different mechanisms, but I can be sure that two people’s brains operate differently, if one sees pictures and the others don’t.</p>
<!--more-->
<p>Knowing about aphantasia is fun because lots of people who can see images are surprised that anyone can’t, and lots of people who can’t are flabbergasted that anyone can – they thought that talking about “picturing a person” was always just a figure of speech. It’s amusing when it comes up in conversation because you tend to get a mix of surprised people on both sides.</p>
<p>I happen to be aphantasiac (if that’s the word for it). I guess I wish I wasn’t; it sounds fun and useful to be able to picture things, and it always feels like I’m missing out on some aspect of human experience by not getting to. (Although I do get glimmers of images when half asleep, I think, and I seem to remember it happening at other times also when I was younger.)</p>
<hr />
<h2 id="mental-math">Mental Math</h2>
<p>Here’s another variation in how human thought works that can be described in concrete terms: how we perform mental math.</p>
<p>First, an example. Try to multiply these two numbers in your head. Don’t look at them while you do it– read the problem, close your eyes, and multiply in your head:</p>
\[45 \times 27\]
<p>This paragraph is going to be random filler text so that the answer isn’t right below the question. If you’re reading it, start thinking about <em>how</em> you do the multiplication in your head. Some large percentage of people will just not do it because they don’t care what a blog post asks them to do. Another large percentage won’t do it because they have a strong nervous aversion to doing math of any sort (these are usually the people who calculate tips on their phone). Another category will do it and not have any idea if the answer they got is correct… – okay, that’s probably enough filler. Moving on.</p>
<p>What’d you get? Here’s the answer, written out in english words so you don’t scan to it automatically: one-thousand, two-hundred and fifteen. Did you get it right? Now try to describe <em>how</em> you computed it. If you can see images in your head, do you imagine doing it on paper, by the grade-school algorithm of stacking the numbers up and multiplying each term? You know, <a href="https://en.wikipedia.org/wiki/Multiplication_algorithm#Long_multiplication">this one</a>. If you can’t, what do you do instead?</p>
<p>I have asked a lot of people to do things like this in person (probably… 40 or so? okay, not that many). It’s anything but a rigorous study, but I’ll tell you what I’ve seen. See if it matches your experience.</p>
<p>First, how you do math in your head has a strong dependence on whether you are aphantasiac. People who can see images (‘phantasiacs’?) overwhelmingly tendency to picture performing the grade-school algorithm. I have only encountered a couple of people who can easily see images in their heads but don’t do math that way.</p>
<p>People who can’t see images obviously can’t picture doing the equation on paper. So how do they do it? There’s a bit more variation here, and it tends to be harder to explain.</p>
<p>I can tell you how I do it: I use my ‘verbal’ brain as something like short-term memory. Verbal memory seems to serve as a small amount of ‘scratch space’, so if I say something in my mind it bounces around for a bit and I can summon it back after thinking about something else. It’s similar to repeating something someone said while you weren’t paying attention; it floats in the background for a few seconds and you can summon it back during that time. There is also some amount of pattern-matching going on, where certain numbers look familiar and comfortable, like the \(2 \times 45\) in the above equation, which immediately feels familiar.</p>
<p>So the process of doing the multiplication above goes something like this in my internal dialogue<sup id="fnref:dialogue" role="doc-noteref"><a href="#fn:dialogue" class="footnote" rel="footnote">1</a></sup>:</p>
<blockquote>
<p>45 times 27… let’s see… <br />
[subconcious realization that it’s going to be easier to multiply \(45(20 + 7)\) than \((40 + 5)(27)\) because \(2 \times 45\) looks ‘pleasant’]<br />
45 times 2 is 90, moved over one, so that’s 900<br />
…7 times 45 is… let’s see..<br />
7 times 4 is 28, so that’s 280<br />
7 and 5 is 35, so those give<br />
[verbally remembering the 280, and recognizing 28 + 3 = 31] … 315<br />
[now I can still hear the 900 from a second ago too, so grab that back]<br />
and so 900 plus 315 is …<br />
[9 and 3 summons 12, and the rest feels like it can be copied over]<br />
…1215</p>
</blockquote>
<p>The text here is slightly subvocalized, and I can feel how it invokes the muscles that would do the speaking if I said it out loud. I guess I compute math by talking to myself. The realizations in []s are things that happen automatically, seemingly through instant pattern recognition – I’ve done a lot of math in my life, and a lot of math in my head, and I guess my brain comfortably knows that 2 times 45 is a comfortable calculation that won’t take a lot of effort. I could do 4 times 27 if I needed to, but my brain seems to prefer smaller numbers.</p>
<p>I’m curious to hear if there are any other methods, besides ‘verbal’ and ‘visual’, and if they work well. Or if there are any other notable counterexamples, of visual-math people who are very accurate or quick. I’m not even sure what I would search to find this, because it isn’t talked about that much!</p>
<hr />
<p>Why does this matter?</p>
<p>Well, the other thing I’ve noticed while asking around about this:</p>
<p>There is a <em>huge</em> correlation between doing math in your head via the visual algorithm and considering yourself ‘not that good at math’.</p>
<p>In fact there is a correlation (in my entirely unscientific survey over the years) between doing math visually and:</p>
<ul>
<li>resisting doing the problem at all</li>
<li>being visibly anxious that you were asked to</li>
<li>not liking math, and especially not liking math in college</li>
<li>getting the wrong answer</li>
<li>not being confident that you got the right answer</li>
</ul>
<p>Although there are notable exceptions. I’ve met math majors who excel at some types of math, but are terrible at numeric calculations, and do math via the visual algorithm. This makes sense; the skills involved in logical proofs are quite different from those involved in raw calculation. And I’ve found one person, if I recall correctly, who does math visually and considers themself very good at it. But it’s <em>rare</em>.</p>
<p>I don’t think any of this is because “visual people are more likely to be bad at math”. I suspect it’s because doing it visually is just a crappy method, and so if you learned it early on in your math education, mental math was always hard, and everything else followed from there. I got lucky by learning to do math a different way early on and so math was always easy for me. Although, it’s hard to be sure whether math was easy for me because I learned to do it in a good way, or if I do it in a good way because math was always going to be easy for me.</p>
<p>And I’m not <em>that</em> good at mental math, not compared to people who are into that kind of thing. But I’m decent, and clearly have enjoyed math enough in my life to, like, blog about it. I’d be curious to hear from someone who has done competition-level mental math about what methods they use, and if they do it visually. I bet I know what the answer will be.</p>
<hr />
<p>These are all my unfounded hypotheses, of course. But: let’s speculate. Why would doing math visually be less effective than doing it through some other system?</p>
<p>First, I wonder (of course I haven’t experienced this!) if maybe the visual brain is less good at error-correction. Perhaps when you see images you capture the ‘gestalt’ of them, rather than careful details, and so if you look away and look back at a pile of numbers you see… another pile of numbers, but not the same one necessarily. It would be easy to confuse digits without realizing it, so you would make mistakes and maybe be unconfident in your result.</p>
<p>For comparison my step of ‘resummoning a term from a few seconds ago’ feels <em>accurate</em>. I am quite sure I remembered the right value, like how I would be sure that I had correctly heard something you said.</p>
<p>Second, I suspect that using the visual algorithm restricts you to doing the steps in a certain order. When I start a problem there’s a planning phase where I figure out where to start: which of these numbers will distribute better? Which makes fewer terms? If you only know how to do right-to-left, second number under the first, then there’s no planning, so there’s no change to make the problem easier.</p>
<p>More problematically, that means that you might not have a step of invoking <em>intuition</em> at all, since the algorithm is rote and formulaic. I suspect that what it means to be ‘good at math’ in general is to have a lot of intuition about it – and if your algorithm since a young age has been formulaic, there would be no opportunities to produce intuition, so you might not have trained that ‘muscle’ at all.</p>
<p>Suffice to say, I suspect that the American trend of deferring menial computation to a calculator is a <em>really terrible</em> practice, and I suspect that a way of teaching people to do mental math via a non-visual algorithm would be far more valuable in math education than just teaching mental math in general. Maybe by learning to multiply in their heads <em>before</em> learning the ‘long multiplication’ algorithm, if that’s possible.</p>
<hr />
<h2 id="so-what">So what?</h2>
<p>It is strange that all it takes to realize that aphantasia is a real thing is to ask people about it – and yet there are all of these people who recently learned about it, myself included, who had just… never asked. What other qualities of our thinking are concretely describable, but haven’t been concretely described, just because we never think to ask?</p>
<p>This feels important. For instance, I happen to think that knowing about aphantasia is important to understanding why some people are good at mental math. But if you didn’t know about aphantasia you would never guess that a skill like mental math might be governed by a variable like that. You might instead chock the variation up to … lack of discipline, or general unintelligence, or bad education, or something.</p>
<p>Since aphantasia is easily missed, this leads to the question: what other variations in our brain’s workings are actually due to some discrete difference in underlying machinery? If someone is, say, depressive, is that because of a complicated combination of variables, or because they have a single binary switch that’s in a different state?</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:dialogue" role="doc-endnote">
<p>which is not literally a dialogue, but feels like using the same part of my brain as speech. Probably there are people who don’t have this also, and there’s probably an obscure term for it. <a href="#fnref:dialogue" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Exterior Algebra Notes #4: The Interior Product2019-01-27T00:00:00+00:00http://alexkritchevsky.com/2019/01/27/interior-product<p><em>Vector spaces are assumed to be finite-dimensional and over \(\bb{R}\). The grade of a multivector \(\alpha\) will be written \(\| \alpha \|\), while its magnitude will be written \(\Vert \alpha \Vert\). Bold letters like \(\b{u}\) will refer to (grade-1) vectors, while Greek letters like \(\alpha\) refer to arbitrary multivectors with grade \(\| \alpha \|\).</em></p>
<p>More notes on exterior algebra. This time, the interior product \(\alpha \cdot \beta\), with a lot more concrete intuition than you’ll see anywhere else, but still not enough.</p>
<p>I am not the only person who has had <a href="https://mathoverflow.net/questions/102917/urge-reason-for-inventing-interior-product-grassmann-algebra">trouble</a> figuring out what the interior product is for. This is what I have so far…</p>
<!--more-->
<hr />
<h2 id="1-the-interior-product">1. The Interior Product</h2>
<p>The last main tool of exterior algebra is the <em>interior product</em>, written \(\alpha \cdot \beta\) or \(\iota_{\alpha} \beta\). It subtracts grades (\(\| \alpha \cdot \beta \| = \| \beta \| - \| \alpha \|\)) and, conceptually, does something akin to ‘dividing \(\alpha\) out of \(\beta\)’. It’s also called the ‘contraction’ or ‘insertion’ operator. We use the same symbol as the inner product because we think of it as a generalization of the inner product: when \(\| \alpha \| = \| \beta \|\), then \(\alpha \cdot \beta = \beta \cdot \alpha = \< \alpha, \beta \>\).</p>
<p>Its abstract definition is that it is adjoint to the wedge product with respect to the inner product:</p>
\[\boxed{\< \gamma \^ \alpha, \beta \> = \< \alpha, \gamma \cdot \beta \>}
\tag{1}\]
<p>In practice this means that it sort of ‘undoes’ wedge products, as we will see.</p>
<p>When we looked at the inner product we had a procedure for computing \(\< \b{a \^ b}, \b{c \^ d} \>\). We switched from the \(\^\) inner product to the \(\o\) inner product, by writing both sides as tensor products, with the right side antisymmetrized using \(\text{Alt}\):<sup id="fnref:alt" role="doc-noteref"><a href="#fn:alt" class="footnote" rel="footnote">1</a></sup></p>
\[\begin{aligned}
\< \b{a \^ b}, \b{c \^ d} \> &= \< \b{a \o b}, \text{Alt}(\b{c \o d}) \> \\
&= \< \b{a \o b}, \b{c \o d - d \o c} \>_{\o} \\
&= \< \b{a, c} \>_V \< \b{b, d} \>_V - \<\b{a, d} \>_V \< \b{b, c} \>_V \\
&= (\b{a \cdot c}) (\b{b \cdot d}) - (\b{a \cdot d})( \b{b \cdot c})
\end{aligned}\]
<p>Interior products directly generalize inner products to cases where the left side has a lower grade<sup id="fnref:rhs" role="doc-noteref"><a href="#fn:rhs" class="footnote" rel="footnote">2</a></sup>, (which is why we use \(\cdot\) for both), and can be computed with the exact same procedure:</p>
\[\begin{aligned}
\b{a} \cdot (\b{b \^ c}) &= \b{a} \cdot (\b{b \o c - c \o b}) \\
&= (\b{a} \cdot \b{b}) \b{c} - (\b{a} \cdot \b{c}) \b{b}
\end{aligned}\]
<p>A general formula for the interior product of a vector with a multivector, which can be deduced from the above, is</p>
\[\b{a} \cdot (\b{b} \^ \beta) = (\b{a} \cdot \b{b}) \^ \beta - \b{b} \^ (\b{a} \cdot \beta)\]
<hr />
<h3 id="11-projection">1.1 Projection</h3>
<p>The intuitive meaning of the interior product is related to <em>projection</em>. We can construct the projection and rejection operators of a vector onto a multivector with:</p>
\[\beta_{\b{a}} = \text{proj}_{\b{a}} \beta = \frac{1}{\| \b{a} \|^2} \b{a} \^ (\b{a} \cdot \beta) \\
\beta_{\perp \b{a}} = \text{proj}_{\perp \b{a}} \beta = \frac{1}{\| \b{a} \|^2} \b{a} \cdot (\b{a} \^ \beta)\]
\[\beta = \beta_{\b{a}} + \beta_{\perp \b{a}} = \frac{1}{\vert \b{a} \vert^2} [\b{a} \^ (\b{a} \cdot \beta) + \b{a} \cdot (\b{b} \^ \beta)]\]
<p>To understand this, recall that the classic formula for projecting onto a <em>unit</em> vector is:</p>
\[\b{b}_{\b{a}} = \b{a} (\b{a} \cdot \b{b})\]
<p>That is, we find the scalar coordinate along \(\b{a}\), then multiply by \(\b{a}\) once again. With multivectors, \(\b{a} \cdot \beta\) is not a scalar, so we can’t just use scalar multiplication – so it makes some sense that it would be replaced with \(\^\).<sup id="fnref:tensor" role="doc-noteref"><a href="#fn:tensor" class="footnote" rel="footnote">3</a></sup></p>
\[\b{b}_{\b{a}} = \b{a} \^ (\b{a} \cdot \b{b})\]
<p>The classic vector rejection formula is</p>
\[\b{b}_{\perp \b{a}} = \b{b} - \b{b}_{\b{a}} = \b{b} - \b{a} (\b{a} \cdot \b{b})\]
<p>Using the interior product we can write this as</p>
\[\b{b}_{\perp \b{a}} = \b{a} \cdot (\b{a} \^ \b{b}) = (\b{a \cdot a}) \b{b} - (\b{a} \cdot \b{b}) \b{a} = \b{b} - \b{b}_{\b{a}}\]
<p>The multivector version \(\b{a} \^ \beta\) is only non-zero if \(\b{\beta}\) has a component which does not contain \(\b{a}\) – all \(\b{a}\)-ness is removed by the wedge product, leaving something like \(\b{a} \^ \beta_{\perp \b{a}}\). Then \(\b{a} \cdot \b{a} \^ \beta_{\perp \b{a}} = \beta_{\perp \b{a}}\).</p>
<p>The correct interpretation of \(\b{a} \cdot \beta\), then, is a lot like what it means when \(\beta = \b{b}\): it’s finding the ‘\(\b{a}\)-component’ of \(\beta\). It’s just that, when \(\beta\) is a multivector, the ‘\(\b{a}\)-coordinate’ is no longer a <em>scalar</em>.</p>
<p>For example this is the ‘\(\b{x}\)‘-component of a bivector \(\b{b \^ c}\):</p>
\[\begin{aligned}
\b{x} \cdot (\b{b \^ c}) &= b_x \b{c} - c_x \b{b} \\
&= b_x (\b{c}_x + \b{c}_{\perp x}) - c_x (\b{b}_x + \b{b}_{\perp x}) \\
&= b_x \b{c}_{\perp x} - c_x \b{b}_{\perp x}
\end{aligned}\]
<p>Note that the result doesn’t have any \(\b{x}\) factors in it.</p>
<p>Once more, to be extra explicit about this intuition:</p>
<p><strong>For a unit multivector \(\alpha\), the meaning of \(\alpha \cdot \beta\) is to find \(\beta_{\alpha}\), the ‘\(\alpha\) component’ of \(\beta\).</strong></p>
<p>We can remove the stipulation that \(\alpha\) be a unit multivector, but it requires being a bit careful. To illustrate why, consider an example with just vectors. What should be the value of \(v_{5x}\)? Probably as \(\frac{1}{5} v_x\), so that \(v_x \b{x} = v_{5x} 5 \b{x}\). Likewise, we need to divide through by the magnitude of \(\alpha\), so it’s actually \(\frac{1}{\Vert \alpha \Vert^2} \iota_{\alpha} \beta = \beta_{\alpha}\).</p>
<p>Unfortunately the rejection formula doesn’t work if \(\alpha\) is a multivector. It’s still true that \(\alpha \cdot \beta\) gives the ‘\(\alpha\)-coordinate’ of \(\beta\), if there is one. But we can only use \(\beta_{\perp \alpha} = \beta - \frac{1}{\Vert \alpha \Vert^2} \alpha \^ (\alpha \cdot \beta)\). The problem is that there are cases where both \(\alpha \^ \beta = \alpha \cdot \beta = 0\), such as for \(\b{x \^ y}\) and \(\b{y \^ z}\).<sup id="fnref:overlap" role="doc-noteref"><a href="#fn:overlap" class="footnote" rel="footnote">4</a></sup></p>
<hr />
<h3 id="12-derivations">1.2 Derivations</h3>
<p>If we consider our projection/rejection operations as operators, writing \(L_{\b{a}} \beta = \b{a} \^ \beta\) and \(\iota_{\b{a}} \beta = \b{a} \cdot \beta\), then:</p>
\[\iota_{\b{a}} \circ L_{\b{a}} + L_{\b{a}} \circ \iota_{\b{a}} = \vert \b{a} \vert^2 (\text{proj}_{\b{a}} + \text{proj}_{\perp \b{a}}) = \vert \b{a} \vert^2\]
<p>Since \(\iota^2 = L^2 = 0\), this could also be written as</p>
\[(\iota_{\b{a}} + L_{\b{a}})^2 = \vert \b{a} \vert^2\]
<p>And in fact this works (although the interpretation is trickier) with different vectors for each term:</p>
\[(\iota_{\b{a}} + L_{\b{b}})^2 = \b{a \cdot b}\]
<p>This turns out to be connected to a lot of other mathematics. The short version is that \(\iota\) is, technically, a <a href="https://en.wikipedia.org/wiki/Derivation_(differential_algebra)">“graded derivation”</a> on the exterior algebra, which is also codified in the ‘Leibniz’ property:</p>
\[\b{v} \cdot (\alpha \^ \beta) = \b{v} \cdot \alpha + (-1)^{\| \alpha \| } \alpha \^ (\b{v} \cdot \beta)\]
<p>This is the exterior algebra version of \(\p(uv) = (\p u) v + u (\p v)\), and the property that \(\iota L + L \iota = I\) is the equivalent of the <a href="(https://en.wikipedia.org/wiki/Weyl_algebra)">fact</a> that \(\p_x x - x \p_x = 1\).</p>
<p>I don’t know much about derivations and abstract algebra yet so that’ll have to wait.</p>
<hr />
<h3 id="13-duality">1.3 Duality</h3>
<p>If we are keeping track of vector space duality, the left side of an interior product \(\alpha \cdot \beta\) should transform like a dual multivector. (It certainly seems like it should because the left side of an inner product \(\< \alpha , \beta \>\) should.) I have been sloppy about this. I’m hoping to collect all of the duality and metric-tensor related stuff in a later article.</p>
<p>The discussion about projection above seems to me to strongly suggest that we define \(\frac{\iota_{\b{a}}}{\vert \b{a} \vert^2} = \b{a}^{-1}\) as a sort of ‘multiplicative inverse’ of \(\b{a}\) for the \(\^\) operation. It’s not a true inverse, because \(\b{a} \^ \b{a}^{-1} \^ \beta = \beta_{\b{a}}\). Instead of being invertible, dividing and then multiplying pulls out the projection on \(\b{a}\). There is a certain elegance to it anyway.</p>
<p>I sometimes suspect that interior products \(\iota_{\alpha}\), and dual vectors in general, should be considered as <em>negative</em>-grade multivectors, so \(\iota_{\alpha} \in \^^{- \| \alpha \|} V\). Then we could write that \(\alpha \cdot \beta \in \^^{\| \beta \| - \| \alpha \|} V\) even if \(\alpha\) has the higher grade. This is also compelling because it explains why dual vectors transform according to the inverse of a transformation: if \(\alpha \ra A^{\^k}(\alpha)\), of course \(\iota_{\alpha} \ra A^{-\^ k} (\iota_{\alpha})\). I hope to look into it this a later article.</p>
<hr />
<h2 id="2-more-identities">2. More identities</h2>
<p>We can use \(\iota\) to prove a few more vector identities. First, note that \(\star\) is just a special case of \(\iota\).<sup id="fnref:equalgrades" role="doc-noteref"><a href="#fn:equalgrades" class="footnote" rel="footnote">5</a></sup></p>
\[\begin{aligned}
\< \alpha \cdot \omega, \star \beta \> &= \< \omega, \alpha \^ \star \beta \> \\
&= \< \omega, \< \alpha, \beta \> \omega \> \\
&= \< \alpha , \beta \> \\
&= \< \star \alpha, \star \beta \>
\end{aligned}\]
<p>Since this holds for all \(\alpha, \beta\):</p>
\[\star \alpha = \alpha \cdot \omega\]
<p>(1) implies that \(\iota\) obeys many of the the same rules as \(\^\):</p>
\[\begin{aligned}
(\alpha \^ \beta) \cdot \gamma &= \beta \cdot (\alpha \cdot \gamma) \\
&= (-1)^{\| \alpha \| \| \beta \|} \alpha \cdot (\beta \cdot \gamma) \\
\end{aligned}\]
<p>Combining these, we have a way to transform applications of \(\star\):</p>
\[\begin{aligned}
\star (\alpha \^ \beta) &= (\alpha \^ \beta) \cdot \omega \\
&= \beta \cdot (\alpha \cdot \omega) \\
&= \beta \cdot (\star \alpha) \\
&= (-1)^{\| \alpha \| \| \beta \|} \alpha \cdot (\star \beta)
\end{aligned}\]
<p>Since the cross product is \(\b{a} \times \b{b} = \star (\b{a} \^ \b{b})\):</p>
\[\b{a} \times \b{b} = \b{b} \cdot (\star \b{a}) = - \b{a} \cdot (\star \b{b})\]
<p>This lets us unpack cross product identities. Note that \(\star^2 = (-1)^{(1)(2)} = 1\) in \(\bb{R}^3\).</p>
<p>Here’s the <a href="https://en.wikipedia.org/wiki/Triple_product#Vector_triple_product">vector triple product</a>:</p>
\[\begin{aligned}
\b{a \times (b \times c)} &= \star(\b{a} \^ \star(\b{b \^ c})) \\
&= - \b{a} \cdot \star^2 (\b{b \^ c}) \\
&= - \b{a} \cdot (\b{b \^ c}) \\
&= (\b{a} \cdot \b{c}) \b{b} - (\b{a} \cdot \b{b}) \b{c}
\end{aligned}\]
<p>The <a href="https://en.wikipedia.org/wiki/Quadruple_product">quadruple product</a>:</p>
\[\begin{aligned}
(\b{a \times b}) \times (\b{c \times d}) &= ((\b{a \times b}) \cdot \b{d}) \b{c} - ((\b{a \times b}) \cdot \b{c}) \b{d} \\
&= \star(\b{a \^ b \^ d}) \b{c} - \star(\b{a \^ b \^ c}) \b{d}
\end{aligned}\]
<p>The <a href="https://en.wikipedia.org/wiki/Jacobi_identity">Jacobi Identity</a>:</p>
\[\begin{aligned}
0 &= \b{a \times (b \times c)} + \b{b \times (c \times a)} + \b{c \times (a \times b)} \\
&= -{\star( \b{a} \cdot (\b{b \^ c}) + \b{b} \cdot (\b{c \^ a}) + \b{c} \cdot (\b{a \^ b}))} \\
&= -{\star( (\b{a \cdot b} - \b{b \cdot a}) \b{c} + (\b{b \cdot c - c \cdot b}) \b{a} + (\b{c \cdot a - a \cdot c}) \b{b})} \\
&= 0
\end{aligned}\]
<p>The Jacobi Identity can also be rearranged into the following intriguing form:</p>
\[\begin{aligned}
\b{a} \cdot (\b{b \^ c}) &= \b{b} \cdot (\b{a \^ c}) - \b{c} \cdot (\b{a \^ b}) \\
\end{aligned}\]
<p>This is equivalent to \(\b{b \cdot a}) \b{c} - (\b{c \cdot a}) \b{b}\), but it hints at greater structure (which is related to \(\cdot\) being a derivation, above). I know it’s involved in Lie Algebras, but I haven’t been able to find a good purely geometric intuition for what it could mean.</p>
<p>(An alternative proof of the Jacobi identity: the exterior algebra element \(\b{a \^ b \^ c}\) corresponds to the tensor algebra element \(\b{a(bc-cb) + b(ca-ac) + c(ab-bc)}\). The identity follows from contracting any two indexes of this tensor together, since it is antisymmetric in all positions.)</p>
<hr />
<h3 id="21-matrix-multiplication">2.1 Matrix Multiplication</h3>
<p>One case where the interior product is already being used in mathematics is when multiplying by an antisymmetric matrix. A bivector \(\b{b \^ c}\) can be represented as a tensor product \(\b{b \o c - c \o b}\), which can be treated as an antisymmetric matrix. The interior product \(\b{a} \cdot (\b{b \^ c})\) is then equivalent to matrix multiplication:</p>
\[\begin{aligned}
\b{a} \cdot (\b{b \^ c}) &= (\b{a \cdot b}) \b{c} - (\b{a \cdot c}) \b{b} \\
&= \begin{pmatrix} 0 & b_x c_y - b_y c_x & b_x c_z - b_z c_x \\
b_y c_x - b_x c_y & 0 & b_y c_z - b_z c_y \\
b_z c_x - b_x c_z & b_z c_y - b_y c_z & 0 \end{pmatrix} \begin{pmatrix} a_x \\ a_y \\ a_z \end{pmatrix}
\end{aligned}\]
<p>For instance this is one way of writing a rotation operator which rotates vectors by \(\frac{\pi}{2}\) in the \(\b{bc}\) plane (if \(\b{b}, \b{c}\) are orthogonal unit vectors):</p>
\[R_{\b{bc}} (\b{a}) = \b{a} \cdot (\b{b \^ c})\]
<hr />
<p>Other articles related to Exterior Algebra:</p>
<ol start="0">
<li><a href="/2018/08/06/oriented-area.html">Oriented Areas and the Shoelace Formula</a></li>
<li><a href="/2018/10/08/exterior-1.html">Matrices and Determinants</a></li>
<li><a href="/2018/10/09/exterior-2.html">The Inner product</a></li>
<li><a href="/2019/01/26/hodge-star.html">The Hodge Star</a></li>
<li><a href="/2019/01/27/interior-product.html">The Interior Product</a></li>
<li><a href="/2020/10/15/ea-operations.html">All the Exterior Algebra Operations</a></li>
</ol>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:alt" role="doc-endnote">
<p>Recall that we basically elect to antisymmetrize <em>one</em> side because if we did both we would need an extra factor of \(1/n!\) for the same result. It might be that there are abstractions of this where you do need to do both sides (for instance if \(a \cdot b \neq b \cdot a\)?) <a href="#fnref:alt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:rhs" role="doc-endnote">
<p>It is probably possible to generalize to either side having the lower grade, but it’s not normally done that way. I want to investigate it sometime. <a href="#fnref:rhs" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:tensor" role="doc-endnote">
<p>the other candidate would be \(\o\), but we’d like the result to also be a multivector so it makes sense to only consider \(\^\). <a href="#fnref:tensor" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:overlap" role="doc-endnote">
<p>I think there’s a way to make it work. It looks something like: for each basis multivector of lower grade, remove it from both sides, like \((\b{x} \cdot \alpha) \cdot (\b{x} \cdot \beta)\). But that’s complicated and will have to be saved for the future. <a href="#fnref:overlap" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:equalgrades" role="doc-endnote">
<p>It is easier to use the \(\cdot\) notation for inner products, since after all they are a special case of interior products. But sometimes I use \(\<, \>\) anyway when it makes things clearer. <a href="#fnref:equalgrades" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Exterior Algebra Notes #3: The Hodge Star2019-01-26T00:00:00+00:00http://alexkritchevsky.com/2019/01/26/hodge-star<p><em>Previously: <a href="/2018/10/08/exterior-1.html">matrices</a> and <a href="/2018/10/09/exterior-2.html">inner products</a> on exterior algebras.</em></p>
<p><em>Vector spaces are assumed to be finite-dimensional and over \(\bb{R}\). The grade of a multivector \(\alpha\) will be written \(\| \alpha \|\), while its magnitude will be written \(\Vert \alpha \Vert\). Bold letters like \(\b{u}\) will refer to (grade-1) vectors, while Greek letters like \(\alpha\) refer to arbitrary multivectors with grade \(\| \alpha \|\).</em></p>
<!--more-->
<p>More notes on exterior algebra. This time, the Hodge Star operator \(\star \alpha\).</p>
<p>This is where things start to get confusing: there’s a bunch of operations that do… stuff… to multivectors, but they start to stray from geometric intuition and there are tons of formulas capturing all their relationships. It’s pretty unsatisfying. I’m gonna write it out anyway, but in a few posts I hope to figure out a way to know all of this stuff without having to remember so many equations.</p>
<hr />
<h2 id="1-the-hodge-star">1. The Hodge Star</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Hodge_star_operator">Hodge Star</a> operator takes multivectors to their ‘complements’: \(\star \b{x} = \b{y \^ z}\). Given a choice of pseudoscalar \(\omega\), \(\star \alpha\) is a multivector such that \(\alpha \^ \star \alpha = \omega\). Notice that the operation is entirely defined <em>relative</em> to a choice of pseudoscalar. This means that \(\star \b{x}\) means different things depending on whether you’re operating in \(\bb{R}^2\), \(\bb{R}^3\), etc.</p>
<p>Other notations for \(\star \alpha\) include \(\ast \alpha\) or \(\vert \alpha\) or even ~\(\alpha\). Weird.</p>
<p>The choice of unit pseudoscalar \(\omega = \b{x \^ y \^ z}\) amounts to defining a global orientation and picking that you want to use the ‘right hand rule’ for cross products. It’s otherwise kinda arbitrary. Usually we just use alphabetical order for the basis vectors, and that’s what is meant if we don’t specify otherwise.</p>
<p>\(\alpha \^ \star \alpha = \omega\) doesn’t totally specify \(\star \alpha\) (we can freely add any component orthogonal to \(\alpha\).). So instead we define \(\star\) so that this holds for <em>any</em> choice of \(\beta\):</p>
\[\boxed{\beta \^ (\star \alpha) = \< \beta, \alpha \> \omega} \tag{1}\]
<p>Although sometimes people define the inner product in terms of the star instead:</p>
\[\< \beta, \alpha \> = \star (\beta \^ (\star \alpha))\]
<p>This definition means: to compute \(\star \alpha\), figure out what value of \(\star \alpha\) satisfies \(\alpha \^ (\star \alpha) = \< \alpha, \alpha \> \omega\). Then ensure that the coefficient of anything orthogonal to \(\alpha\) is \(0\).</p>
<p>Example in \(\bb{R}^3\): \(\star (2 \b{x}) = 2 \b{ y \^ z}\), because \((2 \b{x}) \^ (2 \b{y \^ z}) = (2 \b{x} \cdot 2 \b{x}) \omega = 4 \omega\). Of course, \(\star \b{x} \stackrel{?}{=} 2 \b{y \^ z} + 2 \b{x \^ z}\) also satisfies \(\alpha \^ \star \alpha = 4 \omega\), but it violates (1), because \(\b{y} \^ \star \b{x} = (\b{y} \cdot \b{x}) \omega = 0\), but \(\b{y} \^ (2 \b{x \^ z}) = -2 \omega \neq 0\). So we have to set the coefficient of \(\b{x \^ z}\) (and \(\b{x \^ y}\) for the same reason) to \(0\).</p>
<p>We will use ourselves \(\star^2\) a lot, so let’s write that down.</p>
\[\boxed{\star^2 \alpha = (-1)^{|\alpha||{\star \alpha}|} \alpha} \tag{2}\]
<p>The sign comes from having to move all the terms of \(\alpha\) and \(\star \alpha\) past each other to get back to the original ordering (work through an example to see why). \(\star^2 = I\) is always true for \(n\) odd, and for \(n\) even is true if \(k\) is also even. So \(\star^2 = - I\) if and only if (\(n\) even, \(k\) odd). This is the case when \((n=2, k=1)\), ie when rotating a vector in a plane, so \(\star^2_{\bb{R}^2} \b{x} = - \b{x}\).</p>
<p>We will often use the fact that \(\star\) preserves inner products:</p>
\[\begin{aligned} \< \star \alpha, \star \beta \> \omega &= \star \alpha \^ \star^2 \beta \\
&= (-1)^{\| \star \alpha \| \| \beta \|} (-1)^{\| \beta \| \| \star \beta \|} {\beta \^ \star \alpha} \\
&= (-1)^{2(n-k)k} \< \beta, \alpha \>\omega \\
&= \< \alpha, \beta \> \omega \\
\< \star \alpha, \star \beta \> &= \boxed{\< \alpha, \beta \>}
\tag{3}
\end{aligned}\]
<p>\(\star\) is a linear transformation from \(\^ V \ra \^ V\), and when restricted to elements of a particular grade, from spaces \(\^^{k} V \ra \^^{n-k} V\). We can write this linear transformation down as a matrix. In \(\bb{R}^2\) it takes vectors to vectors, so has a the particularly nice form of the standard rotation matrix:</p>
\[\star_{\bb{R}^2} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}\]
<p>Wikipedia will tell you that the main purpose of \(\star\) is to define, like, the de Rham cohomology on differential manifolds, or something like that. Wikipedia is silly and that was clearly written by a specialist. The main purpose of \(\star\) is to be able to do geometry and physics for 150 years while pretending that multivectors don’t exist. As we will see, anything that uses a cross product (areas, magnetic fields, angular momenta) is, in fact, a bivector, and we get away with ignoring that fact because \(\star\) lets us treat them like vectors.</p>
<hr />
<h2 id="2-note-on-duality">2. Note on Duality</h2>
<p>The Hodge star is also called the ‘Hodge dual’. This is not the same thing vector space duality. But it’s related, so it can get kinda confusing.</p>
<p>When you are keeping track of duality, I have found references which define \(\star\) as both \(\^^k V \ra \^^{n-k}V\) and as \(\^^k V \ra \^^{n-k} V^*\). I suppose it depends on whether you assuming the existence of an inner product as well, which lets you identify \(\^^{n-k} V \simeq \^^{n-k} V^*\). As mentioned above, some sources define \(\star\) in terms of the inner product and some define the inner product in terms of \(\star\).</p>
<p>Even without an inner product or \(\star\), the wedge product provides a construction of the dual space. Since the space \(\^^n V\) is one-dimensional, it is isomorphic to \(\bb{R}\), although there is no canonical choice of isomorphism (which amounts to picking a pseudoscalar). This means the map \(\un{k}{\alpha} \ra \alpha \^ \un{n-k}{\beta} \in \^^n V\) is a map \(\^^k V \ra \bb{R}\), so \(\^^{n-k} V \simeq (\^^k V)^*\) and \(\^^k V \simeq (\^^{n-k}V)^*\). This means the spaces are dual, in that each can be treated as a map from the other to \(\bb{R}\), but it doesn’t select a particular definition of this map.</p>
<p>A choice of \(\star\) or inner product (each can be defined in terms of the other) additionally maps \(\^^k \simeq \^^{n-k} V\), and thus provides the isomorphism \(\^^{n-k} V \simeq \^^{n-k} V^*\). Something like that. See <a href="https://math.stackexchange.com/questions/872/what-is-the-relationship-between-the-hodge-dual-of-p-vectors-and-the-dual-space">this</a> explanation for more.</p>
<p>In physics, at least, we invoke vector space duality when we are concerned with creating coordinate-independent quantities. In a coordinate-independent inner product like \(a \cdot b\), one side must be contravariant and the other covariant, so that the result is invariant. Since \(\star (a \cdot b) = a \^ \star b\), \(b\) and \(\star b\) must have opposite variance, so \(\star\) has to apply the metric to perform this operation. The result is unwieldy (you can see one version on <a href="https://en.wikipedia.org/wiki/Hodge_star_operator#Computation_in_index_notation">wikipedia</a>), but I think I am going to wait on figuring it out for myself until I contend with everything else in a covariant way also in some future article.</p>
<p>It is worth noting that \(\star\) itself is not coordinate-invariant, at all. After all it depends on a particular choice of pseudoscalar, which is only preserved by orientation-preserving orthonormal transformations.</p>
<hr />
<h2 id="3-the-cross-product">3. The Cross Product</h2>
<p>My goal with these articles has so far been to demonstrate the usefulness of exterior algebra by cleanly showing how it leads to lots of vector identities. Thus, we should go ahead and address the cross product.</p>
<p>In \(\bb{R}^3\), the vector cross product takes two vectors and produces a third vector, orthogonal to them. This is better understood as taking their wedge product, then using \(\star\) to map that to a vector:</p>
\[\b{a \times b} = \star(\b{a \^ b})\]
<p>I say ‘better understood’ because this understanding elucidates properties such as how as a cross product <a href="https://en.wikipedia.org/wiki/Pseudovector">transforms</a> under coordinate transformations. And if you just stick with the bivector, you don’t have to worry about the right hand rule either.</p>
<p>Some people will tell you that there’s also a 7-dimensional cross product. They are basically wrong. Well, there is one, sort of, but it’s the wrong generalization, and it’s useless for geometry. \(a \times b = \star (a \^ b)\) is the definition that extends to other dimensions the way you want – it’s just that it only maps vectors to vectors in \(\bb{R}^3\). Alternatively, you could define it to take \(n-1\) vectors to another vector in any \(\bb{R}^n\), but at that point why aren’t you just using \(\^\)?</p>
<p>In these identities we’ll assume the standard unit pseudoscalar with basis vectors in alphabetical order.</p>
<p>We can quickly compute the inner product of two cross products using (3):</p>
\[\begin{aligned}
\< \b{a \times b}, \b{c \times d} \> &= \< \star (\b{a \^ b}), \star (\b{c \^ d}) \> \\
&= \< \b{a \^ b}, \b{c \^ d} \> \\
&= (\b{a} \cdot \b{c}) (\b{b \cdot d}) - (\b{a \cdot d}) (\b{b \cdot c})
\end{aligned}\]
<p>And here’s the <a href="https://en.wikipedia.org/wiki/Triple_product#Scalar_triple_product">scalar triple product</a>, using (1):</p>
\[\begin{aligned}
\b{a \cdot (b \times c)} &= \b{a} \cdot \star (\b{b \^ c}) \\
&= \star(\b{b \^ c \^ a}) \\
&= \star(\b{a \^ b \^ c})
\end{aligned}\]
<p>The \(\star\) on the result reflects the fact that \(\b{a \^ b \^ c} \in \^^{3} \bb{R}^3\), so we apply \(\star\) to get a scalar value.</p>
<p>Just like we can lift a linear transformation \(A\) to its action on a wedge product, such as \(A^{\^ 2}(\b{x \^ y}) = A(\b{x}) \^ A(\b{y})\), we can (presumably) lift it to its action on a Hodge star or a cross product. That is, this transformation exists for \(A: V \ra V\):</p>
\[A^{\star}(\star \b{v}) \equiv \star A(\b{v}) \\
A^{\star} = \star A \star^{-1}\]
<p>This is important because if you have a value which is the result of a cross product, such as a \(\b{z} = \b{x \times y}\), it does not transform like the unit vector \(\b{z}\). We can write down a variant of \(A\) which acts on cross products:</p>
\[A^{\times}(\b{u \times v}) \equiv A(\b{u}) \times A(\b{v}) \\
A^{\times} = \star A^{\^ 2} \star^{-1} = \star A^{\^ 2} \star\]
<p>It turns out that \(A^{\times} = (A^{-1})^T\), which we still aren’t ready to go into detail on because we need to discuss the matrix inverse.</p>
<hr />
<p>(This article used to also include the interior product but it seemed very long and unwieldy, so I’ve split them up.)</p>
<hr />
<p>Other articles related to Exterior Algebra:</p>
<ol start="0">
<li><a href="/2018/08/06/oriented-area.html">Oriented Areas and the Shoelace Formula</a></li>
<li><a href="/2018/10/08/exterior-1.html">Matrices and Determinants</a></li>
<li><a href="/2018/10/09/exterior-2.html">The Inner product</a></li>
<li><a href="/2019/01/26/hodge-star.html">The Hodge Star</a></li>
<li><a href="/2019/01/27/interior-product.html">The Interior Product</a></li>
<li><a href="/2020/10/15/ea-operations.html">All the Exterior Algebra Operations</a></li>
</ol>
All About Taylor Series2018-12-28T00:00:00+00:00http://alexkritchevsky.com/2018/12/28/taylor-series<p>Here is a survey of understandings on each of the main types of Taylor series:</p>
<ol>
<li>single-variable</li>
<li>multivariable \(\bb{R}^n \ra \bb{R}\)</li>
<li>multivariable \(\bb{R}^n \ra \bb{R}^m\)</li>
<li>complex \(\bb{C} \ra \bb{C}\)</li>
</ol>
<p>I thought it would be useful to have everything I know about these written down in one place.</p>
<p>These notes are not pedagogical; they’re for crystallizing everything when you already have a partial understanding of what’s going on. Particularly, I don’t want to have to remember the difference between all the different flavors of Taylor series, so I find it helpful to just cast them all into the same form, which is possible because they’re all the same thing (seriously why aren’t they taught this way?).</p>
<p>In these notes I am going to ignore discussions of convergence so that more ground can be covered. Generally it’s important to address convergence in order to, well, not be wrong. And I’m certain that I’ve made statements which are wrong below. But I am just trying to make sure I understand what happens when everything works, because in my interests (physics) it usually does.</p>
<!--more-->
<hr />
<h2 id="1-single-variable">1. Single Variable</h2>
<p>A Taylor series for a function in \(\bb{R}\) looks like this:</p>
\[\begin{aligned} f(x + \e) &= f(x) + f'(x) \e + f''(x) \frac{\e^2}{2} + \ldots \\
&= \sum_n f^{(n)} \frac{\e^n}{n!}
\end{aligned}\]
<p>It’s useful to write this as one big operator acting on \(f(x)\):</p>
\[\boxed{f(x + \e) = \big[ \sum_{n=0}^\infty \frac{\p^n_x \e^n}{n!} \big] f(x)} \tag{Single-Variable}\]
<p>Or even as a single exponentiation of the derivative operator, which is commonly done in physics, but you probably shouldn’t think too hard about what it <a href="https://en.wikipedia.org/wiki/Exponential_map_(Lie_theory)">means</a>:</p>
\[f(x + \e) = e^{\e \p_x} f(x)\]
<p>I also think it’s useful to interpret the Taylor series equation as resulting from repeated integration:</p>
\[\begin{aligned}
f(x) &= f(0) + \int_0^x dx_1 f'(x_1) \\
&= f(0) + \int_0^x dx_1 [ f'(0) + \int_0^{x_1} dx_2 f''(x_2) ]] + \ldots\\
&= f(0) + \int dx_1 f'(0) + \iint dx_1 dx_2 f''(0) + \iiint dx_1 dx_2 dx_3 f'''(0) + \ldots \\
&= f(0) + x f'(0) + \frac{x^2}{2} f''(0) + \frac{x^3}{3!} f'''(0) + \ldots
\end{aligned}\]
<p>This basically makes sense as soon as you understand integration, plus it makes obvious that the series only works when all of the integrals are actually equal to the values of the previous function (so you can’t take a series of \(\frac{1}{1-x}\) which passes \(x=1\), because you can’t exactly integrate past it (though there are tricks))</p>
<p>… plus it makes sense in pretty much any space you can integrate over.</p>
<p>… <em>plus</em> it makes it obvious how to truncate the series, how to create the remainder term, and it even shows you how you could – if you were so inclined – have each derivative be evaluated at a different point, such as \(f(x) = f(0) + \int_1^x f'(x_1) dx_1 =f(0) + (x-1) f'(1) + \frac{(x-1)(x-2)}{2} f''(2) + \ldots\), which I’ve never even seen done before (except for <a href="https://en.wikipedia.org/wiki/Finite_difference#Newton's_series">here?</a>), though good luck with figuring out convergence if you do that.</p>
<hr />
<p><a href="https://en.wikipedia.org/wiki/L%27H%C3%B4pital%27s_rule">L’Hôpital’s rule</a> about evaluating limits which give indeterminate forms follows naturally if the functions are both expressible as Taylor series. If \(f(x) = g(x) = 0\), then:</p>
\[\begin{aligned}
\lim_{\e \ra 0} \frac{f(x + \e)}{g(x + \e)} &= \lim_{\e \ra 0} \frac{ f(x) + \e f'(x + \e) + O(\e^2)} {g(x) + \e g'(x + \e) + O(\e^2)} \\
&= \lim_{\e \ra 0}\frac{f'(x+\e) + O(\e) }{g'(x+\e) + O(\e)} \\
&= \lim_{\e \ra 0} \frac{f'(x+\e)}{g'(x + \e)}
\end{aligned}\]
<p>Which equals \(\frac{f'(x)}{g'(x)}\) if the limit exists, and otherwise might be solvable by applying the rule recursively. None of this works of course if limit doesn’t exist. If \(f(x) = g(x) = \infty\), evaluate \(\lim \frac{1/g(x)}{1/f(x)}\) instead. If the indeterminate form is \(\infty - \infty\), evaluate \(\lim f(x) - g(x)\) instead.</p>
<hr />
<h2 id="2-multivariable---scalar">2. Multivariable -> Scalar</h2>
<p>The multivariable Taylor series looks messier at first, so let’s start with only two variables, writing \(f_x \equiv \p_x f(\b{x})\) and \(\b{v} = (v_x, v_y)\), and we’ll work it into a more usable form.</p>
\[\begin{aligned}
f(\b x + \b v) &= f(\b x) + [f_x v_x + f_y v_y] + \frac{1}{2!} [f_{xx} v_x^2 + 2 f_{xy} v_x v_y + f_{yy} v_y^2] \\
&+ \frac{1}{3!} [f_{xxx} v_x^3 + 3 f_{xxy} v_x^2 v_y + 3 f_{xyy} v_x v_y^2 + f_{yyy} v_y^3] + \ldots
\end{aligned}\]
<p>(The strangeness of the terms like \(2 f_{xy} v_x v_y\) and \(3 f_{xxy} v_x^2 v_y\) is because these are really sums of multiple terms; because of the <a href="https://en.wikipedia.org/wiki/Symmetry_of_second_derivatives">commutativity of partial derivatives</a> on analytic functions, \(f_{xy} = f_{yx}\), we can write \(f_{xy} v_x v_y + f_{yx} v_y v_x = 2 f_{xy} v_x v_y\).)</p>
<p>The first few terms are often arranged like this:</p>
\[f(\b x + \b v) = f(\b x) + \b{v} \cdot \nabla f(\b{x}) + \b{v}^T \begin{pmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{pmatrix} \b{v} + O(v_3)\]
<p>\(\nabla f(\b{x})\) is the gradient of \(f\) (the vector of partial derivatives like \((f_x, f_y)\). The matrix \(H = \begin{pmatrix} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{pmatrix}\) is the “Hessian matrix” for \(f\), and represents its second derivative.</p>
<p>… But we can do better. In fact, every order of derivative of \(f\) in the total series has the same form, as powers of \(\b{v} \cdot \vec{\nabla}\), which I prefer to write as \(\b{v} \cdot \vec{\p}\), because it represents a ‘vector of partial derivatives’ \(\vec{\p} = (\p_x, \p_y)\):</p>
\[\begin{aligned}
f(\b x + \b v) &= f(\b x) + (v_x \p_x + v_y \p_y) f(\b x) + \frac{(v_x \p_x + v_y \p_y)^2}{2!} f(\b x) + \ldots \\
&= \big[ \sum_n \frac{(v_x \p_x + v_y \p_y)^n}{n!} \big] f(\b x) \\
&= \boxed{ \big[ \sum_{n=0}^\infty \frac{(\b{v} \cdot \vec{\p})^n}{n!} \big] f(\b x) } \end{aligned}
\tag{Scalar Field}\]
<p>(This can also be written as a sum over every individual term using <a href="https://en.wikipedia.org/wiki/Multi-index_notation">multi-index notation</a>.)</p>
<p>So that looks pretty good. And it can still be written as \(e^{ \b{v} \cdot \vec{\p}} f(\b{x})\). The same formula – now that we’ve hidden all the actual indexes – happily continues to work for dimension \(> 2\), as well.</p>
<p>… Actually, this is not as surprising a formula as it might look. The multivariate Taylor series of \(f(\b{x})\) is <em>really</em> just a bunch of single-variable series multiplied together:</p>
\[\begin{aligned}
f(x+ v_x, y + v_y) &= e^{v_x \p_x} f(x, y + v_y) \\
&= e^{v_x \p_x}e^{v_y \p_y} f(x,y) \\
&= e^{v_x \p_x + v_y \p_y} f(x,y) \\
&= e^{\b{v} \cdot \vec{\p}} f(\b{x}) \end{aligned}\]
<p>I mention all this because it’s useful to have a solid idea of what a scalar function is before we move to <em>vector</em> functions.</p>
<hr />
<p>L’Hôpital’s rule is more subtle for multivariable functions. In general the limit of a function may be different depending on what direction you approach from, so an expression like \(\lim_{\b{x} \ra 0} \frac{f(\b{x})}{g(\b{x})}\) is not necessarily defined, even if both \(f\) and \(g\) have Taylor expansions. On the other hand, if we choose a path for \(\b{x} \ra 0\), such as \(\b{x}(t) = (x(t), y(t))\) then this just becomes a one-dimensional limit, and the regular rule applies again. So, for instance, while \(\lim_{\b x \ra 0} \frac{f(\b{x})}{g(\b x)}\) may not be defined, \(\lim_{t \ra 0} \frac{f(t \b{v})}{g(t \b{v})}\) is.</p>
<p>And the path we take to approach \(0\) doesn’t even matter – only the gradients when we’re infinitesimally close to \(0\). For example, suppose we \(f(0,0) = g(0,0) = 0\) and we’re taking the limit on the path given by \(y = x^2\):</p>
\[\lim_{\e \ra 0} \frac{f(\e,\e^2)}{g(\e,\e^2)} = \lim_{ \e \ra 0 } \frac{ f_x(0,0) \e + O(\e^2) }{ g_x(0,0) \e + O(\e^2)} = \lim_{\e \ra 0} \frac{f(\e,0)}{g(\e,0)}\]
<p>The \(f_y\) and \(g_y\) terms are of order \(\e^2\) and so drop out, leaving a limit taken only on the \(x\)-axis – corresponding to the fact that the tangent to \((x,x^2)\) at 0 is \((1,0)\).</p>
<p>In fact, this problem basically exists in 1D also, except that limits can only come from two directions: \(x^+\) and \(x^-\), so lots of functions get away without a problem (but you can also <a href="https://en.wikipedia.org/wiki/Cauchy_principal_value">abuse this</a>). L’Hôpital’s rule only needs that the functions be expandable as a Taylor series on the side the limit comes from.</p>
<p>I think that the concept of a limit that <em>doesn’t</em> specify a direction of approach is more common than it should be, because it’s really quite problematic in practice. I’m not quite sure I fully understand the complexity of solving it in \(N > 1\) dimension – but clearly if you just reduce to a 1-dimensional limit, you sweep the difficulties under the rug anyway. But see, perhaps, <a href="https://arxiv.org/pdf/1209.0363.pdf">this</a> pre-print for a lot more information.</p>
<hr />
<h2 id="3-vector-fields">3. Vector Fields</h2>
<p>There are several types of vector-valued functions – curves like \(\gamma: \bb{R} \ra \bb{R}^n\), or maps between manifolds like \(\b{f}: \bb{R}^m \ra \bb{R}^n\) (including from a space to itself). In each case there is something like a Taylor series that can be defined. It’s not commonly written out, but I think it <em>should be</em>, so let’s try.</p>
<p>Let’s imagine our function maps spaces \(X \ra Y\), where \(X\) has \(m\) coordinates and \(Y\) has \(n\) coordinates, and \(m\) might be 1 in the case of a curve. Then along any <em>particular</em> coordinate in \(Y\) out of the \(n\) – call it \(y_i\) – the Taylor series expression from above holds, because \(f_i = \b{f} \cdot y_i\) is just a scalar function.</p>
\[f(\b{x} + \b{v})_i = e^{\b{v} \cdot \vec{\p}} [f(\b{x})_i]\]
<p>But of course this holds in every \(i\) at once, so it holds for the whole function:</p>
\[\b{f}(\b{x} + \b{v}) = e^{\b{v} \cdot \vec{\p}} \b{f}(\b{x})\]
<p>The subtlety here is that the partial derivatives \(\p\) are now being taken <em>termwise</em> – once for each component of \(\b{f}\). For example, consider the first few terms when \(X\) and \(Y\) are 2D:</p>
\[\begin{aligned}
\b{f}(\b{x} + \b{v}) &= \b{f}(\b{x}) + (v_{x_1} \p_{x_1} + v_{x_2} \p_{x_2}) \b{f} + \frac{(v_{x_1} \p_{x_1} + v_{x_2} \p_{x_2})^2}{2!} \b{f} + \ldots\\
&= \b{f} + \begin{pmatrix} \p_{x_1} \b{f}_{y_1} & \p_{x_2} \b{f}_{y_1} \\ \p_{x_1} \b{f}_{y_2} & \p_{x_2} \b{f}_{y_2} \end{pmatrix} \begin{pmatrix} v_{x_1} \\ v_{x_2} \end{pmatrix} + \ldots \\
&= \b{f} +(\p_{x_1}, \p_{x_2}) \o \begin{pmatrix}\b{f}_{y_1} \\ \b{f}_{y_2} \end{pmatrix} \cdot \begin{pmatrix} v_{x_1} \\ v_{x_2} \end{pmatrix} + \ldots
\end{aligned}\]
<p>That matrix term, the \(n=1\) term in the series, is the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian Matrix</a> of \(f\), sometimes written \(J_f\), and is much more succinctly written as \(\vec{\p}_{x_i} \b{f}_{y_j}\), or just \(\vec{\p}_i \b{f}_j\) or even just \(\p_i \b{f}_j\).</p>
\[J_f = \p_i f_j\]
<p>The Jacobian matrix is the ‘first derivative’ of a vector field, and it includes every term which can possibly matter to compute how the function changes to first-order. In the same way that a single-variable function is locally linear (\(f(x + \e) \approx f(x) + \e f'(x)\)), a multi-variable function is locally a linear transformation: \(\b{f}(\b{x + v}) \approx \b{f}(\b{x}) + J_f \b{v}\).</p>
<hr />
<p>Higher-order terms in the vector field Taylor series generalize ‘second’ and ‘third’ derivatives, etc, but they are generally <em>tensors</em> rather than matrices. They look like \((\p \o \p) \b{f}\), \((\p \o \p \o \p) \b{f}\), or \(\p^{\o n} \b{f}\) in general, and they act on \(n\) copies of \(\b{v}\), ie, \(\b{v}^{\o n}\).</p>
<p>The full expansion (for \(X,Y\) of any number of coordinates) is written like this:</p>
\[\begin{aligned} \b{f}(\b{x} + \b{v}) &= \b{f} + \p_i \b{f} \cdot v_i + \frac{1}{2!}(\p_i \p_j \b{f}) \cdot v_i v_j + \frac{1}{3!} (\p_i \p_j \p_k \b{f}) \cdot v_i v_j v_k + \ldots \\
&= \b{f} + \p_i \b{f} \cdot v_i + \frac{1}{2!}(\p_i \p_j) \b{f} \cdot (v_i v_j) + \ldots \\
&= \b{f} +(\b{v} \cdot \vec{\p}) \b{f} + \frac{(\b{v} \cdot \vec{\p})^2}{2!} \b{f} + \ldots \\
\b{f}(\b{x} + \b{v}) &= \boxed{ \big[ \sum_{n=0}^\infty \frac{(\b{v} \cdot \vec{\p})^n}{n!} \big] \b{f}(\b{x}) }
\tag{Vector Field}
\end{aligned}\]
<p>We write the numerator in the summation as \((\b{v} \cdot \vec{\p})^{n}\), which expands to \((\sum_i v_i \p_i) (\sum_j v_j \p_j) \ldots\), and then we can still group things into exponentials, only now we have to understand that all of these terms have derivative operators on them that need to be applied to \(\b{f}\) to be meaningful:</p>
\[\b{f}(\b{x + v}) = e^{\b{v} \cdot \vec{\p}} \b{f}(\b{x})\]
<p>We could have included indexes on \(\b{f}\) also:</p>
\[\begin{aligned}
f_k(\b{x} + \b{v}) &= \b{f}_k + \p_i \b{f}_k \cdot \b{v}_i + \frac{1}{2!}(\p_i \p_j) \b{f}_k \cdot (\b{v}_i \b{v}_j) + \ldots \\
&= \big[ \sum_{n} \frac{(\b{v} \cdot \vec{\p})^n}{n!} \big] f_k(\b{x}) \end{aligned}\]
<p>It seems evident that this should work any other sort of differentiable object also. What about matrices?</p>
\[M_{ij}(\b{x} + \b{v})= \big[ \sum_{n} \frac{(\b{v} \cdot \vec{\p})^n}{n!} \big] M_{ij}(\b{x})\]
<p>I don’t want to talk about curl and divergence here, because it brings in a lot more concepts and I don’t know the best understanding of it, but it’s worth noting that both are formed from components of \(J_f\), appropriately arranged.</p>
<hr />
<h2 id="4-complex-analytic">4. Complex Analytic</h2>
<p>The complex plane \(\bb{C}\) is a sort of change-of-basis of \(\bb{R}^2\), via \((z,\bar{z}) = (x + iy, x - iy)\):</p>
\[z \lra x\b{x} + y\b{y}\]
\[\bar{z} \lra x\b{x} - y\b{y}\]
<p>Therefore we can write it as a Taylor series in these two variables:</p>
\[f(z + \D z, \bar{z} + \D \bar{z}) = \big[ \sum_{n=0}^\infty \frac{(\D z \p + \D \bar{z} \p_{\bar{z}})^n}{n!} \big] f(z, \bar{z})\]
<p>One subtlety: it should always be true that \(\p_{x_i} \b{x}^j = 1_{i = j}\) when changing variables. Because \(z\) and \(\bar{z}\), when considered as vectors in \(\bb{R}^2\), are not <em>unit</em> vectors, there is a normalization factor required on the partial derivatives. Also, for \(\bb{C}\) the factors of \(i\) cause the signs to swap:</p>
\[\begin{aligned}
\p_z &\underset{\bb{C}}{=} \frac{1}{2}(\p_x - i \p_y) \underset{\bb{R}^2}{=} \frac{1}{2}(\p_{\b{x}} + \p_{\b{y}}) \\
\p_{\bar{z}} &\underset{\bb{C}}{=} \frac{1}{2}(\p_x + i \p_y) \underset{\bb{R}^2}{=} \frac{1}{2}(\p_{\b{x}} - \p_{\b{y}})
\end{aligned}\]
<p>In complex analysis, for some reason, \(\bar{z}\) is not treated as a true variable, and we only consider a function as ‘complex differentiable’ when it has derivatives with respect to \(z\) alone. Notably, we would say that the derivative \(\p_z \bar{z}\) does not exist – the value of \(\lim_{(x,y) \ra (0,0)} \frac{x + iy}{x - i y}\) is different depending on the path you take towards the origin. These statements turn out to be <em>almost</em> equivalent:</p>
<ul>
<li>\(f(z)\) is a function of only \(z\) in a region</li>
<li>\(\p_{\bar{z}} f(z) = 0\) in a region</li>
<li>\(f(z)\) is complex-analytic in a region</li>
<li>\(f(z)\) has a Taylor series as a function of \(z\) in a region</li>
</ul>
<p>So when we discuss Taylor series of functions \(\bb{C} \ra \bb{C}\), we usually mean this:</p>
\[\boxed{f(z + \D z) = \big[ \sum_{n=0}^\infty \frac{(\D z \p_z)^n}{n!} \big] f(z)} \tag{Complex-Analytic}\]
<p>If we write \(f(z(x,y)) = u(x,y) + i v(x,y)\), the requirement that \(\p_{\bar{z}} f(z) = \frac{1}{2}(\p_x + i \p_y) f(z) = 0\) becomes the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Riemann_equations">Cauchy-Riemann Equations</a> by matching real and complex parts:</p>
\[\begin{aligned}
u_x &= v_y \\
u_y &= - v_x
\end{aligned}\]
<hr />
<p>There is one important case where a function \(f(z, \bar{z})\) is a function of only \(z\), yet it is <em>not</em> analytic and \(\p_{\bar{z}} f(z) \neq 0\), and it is solely responsible for almost all of the interesting parts of complex analysis. It’s the fact that:</p>
\[\p_{\bar{z}} \frac{1}{z} = 2 \pi i \delta(z, \bar{z})\]
<p>Where \(\delta(z, \bar{z})\) is the two-dimensional Dirac Delta function. I find this to be quite surprising. Here’s an aside on why it’s true:</p>
<aside class="toggleable" id="complex" placeholder="<b>Aside</b>: Conjugate derivatives <em>(click to expand)</em>">
<p>The explanation I’ve seen comes in a few forms, and is usually done in 3D regarding the divergence of \(\frac{1}{r^2}\). One version for \(\bb{C}\) goes like this:</p>
<p>Let \(C\) be a circle of radius \(R\) around the origin, and integrate \(\frac{1}{z}\) in polar coordinates, using the fact that \(dz = d(re^{i \theta}) = dr e^{i \theta} + ire^{i \theta} d\theta\).</p>
\[\begin{aligned}
\oint_C \frac{1}{z} dz &= \oint_C \frac{e^{i \theta} dr + ir e^{i \theta} d\theta}{r e^{i \theta}} \\
&= \oint_C \frac{dr}{r} + \oint_C i d\theta \\
&= 0 + 2 \pi i
\end{aligned}\]
<p>Now apply Stokes’ theorem to the integral (using the notations of <a href="https://en.wikipedia.org/wiki/Differential_form">differential forms</a>):</p>
\[\begin{aligned}
2 \pi i &= \iint_D d(\frac{1}{z} dz) \\
&= \iint_D (\p_{\bar{z}} \frac{1}{z}) d\bar{z} \^ dz \end{aligned}\]
<p>Because we only really know how to deal with delta-functions in \((x,y)\) coordinates, change variables, using \(d\bar{z} \^ dz = (dx - i dy) \^ (dx + i dy) = 2i dx \^ dy\):</p>
\[\begin{aligned}
2 \pi i &= \iint_D (\p_{\bar{z}} \frac{1}{z}) d\bar{z} \^ dz \\
2 \pi i&= 2i \iint_D (\p_{\bar{z}} \frac{1}{z}) dx \^ dy \\
\pi &= \iint_D (\p_{\bar{z}} \frac{1}{z}) dx \^ dy
\end{aligned}\]
<p>Because this is true on <em>any</em> circle around the origin, the term in the integral is behaving like a delta distribution:</p>
\[\p_{\bar{z}}\frac{1}{z} \equiv \pi \delta(x,y)\]
<p>If we want to express this as a delta function on \(\bb{C}\), we need \(2 \pi i = \iint_D \p_{\bar{z}} (\frac{1}{z}) d \bar{z} \^ dz\) to be true, so this should hold when considered as an equality of distributions:</p>
\[\p_{\bar{z}} \frac{1}{z} \equiv 2 \pi i \delta(z, \bar{z})\]
<p>If nothing else this argument convinces me that delta functions should be dealt with in introductory multivariable calculus, and that complex analysis is pointlessly confusing.</p>
<p>One final comment: what does it <em>mean</em> for \(\p_{\bar{z}} \frac{1}{z} = 2 \pi i \delta(z, \bar{z})\) to be true? Well, it turns out that this effect exists even in one dimension, except that it’s exhibited by \(\ln x\) instead of \(\frac{1}{x}\) (and it’s by \(\frac{1}{x^2}\) in 3 dimensions, etc).</p>
<p>Any negative real number has a logarithm like \(\ln (-1) = i \pi\), due to the fact that \(e^{i \pi} = -1\). This means that \(\lim_{x \ra 0^-} \ln x = i \pi\), while \(\lim_{x \ra 0^+} \ln x = 0\). This means that it should be true that \(\frac{d}{dx} \ln x = \frac{1}{x} + i \pi \delta(x)\), at least when it appears under an integral. I suppose that the delta-function derivative of \(\frac{1}{z}\) amounts to the same effect in 2D, but I’m not sure of a good way to intuit the details.</p>
</aside>
<p>Importantly, \(\p_{\bar{z}} z^n \neq 0\) is <em>only</em> true for \(n = -1\). This property gives rise to the entire method of <a href="https://en.wikipedia.org/wiki/Residue_theorem">residues</a>, because if \(f(z) = \frac{f_{-1}(0) }{z} + f^*(z)\), where \(f^*(z)\) has no terms of order \(\frac{1}{z}\), then integrating a contour \(C\) around a region \(D\) which contains \(0\) gives, via Stokes’ theorem:</p>
\[\begin{aligned}
\oint_C f(z) dz &= \iint_D \p_{\bar{z}} \big[ \frac{f_{-1}(0) }{z} + f^*(z) \big] \; d\bar{z} \^ dz \\
&= 2 \pi i \iint_D \delta(z, \bar{z}) f_{-1}(0) \; d\bar{z} \^ dz \\
&= 2 \pi i f_{-1}(0)
\end{aligned}\]
<p>(If the \(\bar{z}\) derivative isn’t \(0\), you get the <a href="https://en.wikipedia.org/wiki/Cauchy%27s_integral_formula#Smooth_functions">Cauchy-Pompeiu formula</a> for contour integrals immediately.)</p>
<p>By the way: Fourier series are closely related to contour integrals, and thus to complex Taylor series. You can change variables to write \(\frac{1}{2 \pi i} \oint_C \frac{F(z)}{z^{k+1}} dz\) as \(\frac{1}{2 \pi} \oint_C F(re^{i \theta})e^{-ik\theta} d\theta\), which is clearly a Fourier transform for suitable \(F\).</p>
<aside class="toggleable" id="fourier" placeholder="<b>Aside</b>: Contour Integrals as Fourier Transforms <em>(click to expand)</em>">
<p>Here’s a heuristic argument, skipping all the analytical details because I don’t know or care about them.</p>
<p>If a function \(f(x)\) on the real axis has a Fourier series in the finite interval \((0,2 \pi )\), then it can be written as a series of oscillations at different frequencies:</p>
\[f(x) = \sum F(k) e^{i k x}\]
<p>The Fourier transform extracts these \(F(k)\) coefficents:</p>
\[F(k) = \frac{1}{2 \pi} \int_0^{2 \pi} e^{-ikx} f(x) dx\]
<p>Now we change variables: \(z = e^{ix}\), \(x = \frac{1}{i} \ln z\) and \(dx = \frac{dz}{ i z}\). This turns the integral on a line segment \((0, 2\pi)\) into a <em>contour integral</em> around the origin (obviously this is why I used the range \((0, 2\pi)\) in the first place).</p>
\[f(\frac{1}{i} \ln z) = \sum_k F(k) z^k\]
\[\begin{aligned}
F(k) &= \frac{1}{2 \pi} \oint_C z^{-k} f(x) \frac{dz}{iz} \\
&= \frac{1}{2 \pi i} \oint_C \frac{1}{ z^{k+1}} \sum_{k'} F(k') z^{k'} dz \\
&= \frac{1}{2 \pi i} \oint_C \sum_{k'} \frac{F(k')}{ z^{k - k' + 1}} dz \\
&= \sum_{k'} \delta(k - k') F(k') \\
&= F(k)
\end{aligned}\]
<p>This generally works for functions defined on any finite range; we can modify the variables appropriately to move the contour bounds to a single loop.</p>
<p>This logic shows that the ‘orthogonalization’ property of Fourier transforms is basically the same thing as the ability of contour integrals to pull out \(f_{-1}\) coefficients. Actually, if anything, I think the Fourier transform is more fundamental. A contour integral of a function \(\oint_C \frac{f(z) dz}{z}\), when written in polar coordinates, is clearly related to a Fourier transform:</p>
\[\begin{aligned}
\oint_C \frac{f(z) dz}{z^{k+1}} &=
\oint_C \frac{f(re^{i \theta}) e^{i\theta} dr}{re^{(k+1)i \theta}} +
\oint_C \frac{f(re^{i \theta}) ire^{i\theta} d\theta}{re^{(k+1)i \theta}} \\
&=
i \oint_C f(re^{i \theta})e^{-ik\theta} d\theta
\end{aligned}\]
<p>The first integral term vanishes because \(r\) is the same on both ends of the contour. The second integral is, or is massageable into (if the contour takes a funny path), a Fourier transform.</p>
<p>I like this realization. Of the two concepts, I feel like Fourier transforms deal with the more tangible concept – of breaking a function down into its frequency components. The contour integral turns out to measure which components of \(F\) are rotating in such a way as to exactly cancel out the rotating path around the pole at \(\frac{1}{z}\). Though I’m not sure how to reconcile the fact that contour integrals can count paths around multiple poles at the same time – that seems equivalent to taking Fourier transforms in multiple frequencies in a single integral, and I’m not sure why you would want to do that.</p>
</aside>
Infinite Summations and You2018-11-01T14:00:00+00:00http://alexkritchevsky.com/2018/11/01/summations<p>You may have seen that <a href="https://www.youtube.com/watch?v=w-I6XTVZXww">Youtube video</a> by Numberphile that circulated the social media world a few years ago. It showed an ‘astounding’ mathematical result:</p>
\[1+2+3+4+5+\ldots = -\frac{1}{12}\]
<p>(quote: “the answer to this sum is, remarkably, minus a twelfth”)</p>
<p>Then they tell you that this result is used in many areas of physics, and show you a page of a string theory textbook (<em>oooo</em>) that states it as a theorem.</p>
<p>The video caused a bit of an uproar at the time, since it was many people’s first introduction to the (rather outrageous) idea and they had all sorts of (very reasonable) objections.</p>
<p>I’m interested in talking about this because: I think it’s important to think about how to deal with experts telling you something that seems insane, and this is a nice microcosm for that problem.</p>
<p>Because, well, the world of mathematics seems to have been irresponsible here. It’s fine to get excited about strange mathematical results. But it’s not fine to present something that requires a lot of asterixes and disclaimers as simply “true”. The equation is <em>true</em> only in the sense that if you subtly change the meanings of lots of symbols, it can be shown to become true. But that’s not the same thing as quotidian, useful, everyday truth. And now that this is ‘out’, as it were, we have to figure out how to cope with it. Is it true? False? Something else? Let’s discuss.</p>
<!--more-->
<hr />
<h2 id="the-proof">The Proof</h2>
<p>First, here’s the ‘proof’ from the video.</p>
<p>Start with the simpler sum:</p>
\[P = 1 - 1 + 1 - 1 + 1\ldots\]
<p>Clearly the value of P oscillates between 1 and 0 depending on how many terms you include. Numberphile decides that it equals \(\frac{1}{2}\), because that’s halfway in the middle. Alternatively, consider P+P with the terms interleaved, and then let’s do some quirky arithmetic:</p>
\[\begin{aligned}
P+P = 1&-1+1-1\ldots \\
&+ 1-1+1\ldots \\
= 1 &+ (-1+1) + (1-1) \ldots \\
= 1 &+ 0 + 0 + \ldots \\
= 1 &\, \end{aligned}\]
<p>So \(2P = 1\), so \(P = \frac{1}{2}\) we guess?</p>
<p>Now consider \(Q = 1-2+3-4+5\ldots\), and write out \(Q+Q\) this way:</p>
\[\begin{aligned}
Q+Q = 1&-2+3-4+5\ldots \\
&+ 1 -2+3-4\ldots \\
= 1&-1+1-1+1 \\
\Ra 2Q = \frac{1}{2} &\, \end{aligned}\]
<p>So \(Q = \frac{1}{4}\).</p>
<p>Now consider \(S = 1+2+3+4+5\ldots\), and write \(S-4S\) as</p>
\[\begin{aligned} S - 4S &= (1+2+3+4+5\ldots) \\
&- (0 + 4 + 0 + 8 + \ldots) \\
&=1-2+3-4+5\ldots \\
-3S &= Q=\frac{1}{4} \\
\Ra S &= \boxed{-\frac{1}{12}}\end{aligned}\]
<p class="indent">How do you feel about that? Probably amused but otherwise not very good, regardless of your mathematical background. But in another way it’s really convincing - I mean, by god! – string theorists use it! And, to quote the video, “these kinds of sums appear all over physics” (which I think isn’t really true, but they do appear occasionally).</p>
<p class="indent">So the question is this: when you see a video or hear a proof like this, do you <em>believe them</em>? Even if it’s not your field, and not in your area of expertise, do you believe someone who tells you “even though you thought mathematics worked this way, it actually doesn’t; it’s still totally mystical and insane results are lurking just around the corner if you know where to look”? What if they tell you string theorists use it, and it appears all over physics?</p>
<p class="indent">I imagine this as a sort of rationality litmus test. See how you react to the video or the proof (or remember how you reacted when you initially heard this argument). Is it the ‘rational response’? How do you weigh your own intuitions vs a convincing argument from authority plus math that seems to somehow work, if you turn your head a bit sideways?</p>
<p class="indent">If you don’t believe them, what does that feel like? How confident are you?</p>
<hr />
<h2 id="the-problem">The Problem</h2>
<p class="indent">It’s totally true that, as an everyday thinker (or even as a professional scientist or mathematician), there will always be computational conclusions that are out of your reach to verify. You pretty much have to believe theoretical physicists who tell you “the Standard Model of particle physics accurately models reality and predicts basically everything we see at the subatomic scale with unerring accuracy”; you’re likely in no position to argue, and if you are there’s something else you’re not equipped to argue about.</p>
<p class="indent">But - and this is the point - it’s <strong>highly unlikely that all of your cognitive tools are lies</strong>, even if ‘experts’ say so, and you ought to require extraordinary evidence to be convinced that they are. It’s not enough that someone out there can contrive a plausible-sounding argument that you don’t know how to refute, if your tools are logically sound and their claims don’t fit into that logic.<sup id="fnref:lies" role="doc-noteref"><a href="#fn:lies" class="footnote" rel="footnote">1</a></sup></p>
<p class="indent">On the other hand, if you believe something because you heard it was a good idea from one expert, and then another expert tells you a contradictory thing, then, all other things being equal, you have essentially no way of choosing. Take your pick; there’s no way to tell. If one is more convincing than the other, that’s <em>evidence</em>, but still not useful if you can’t understand the argument enough to validate it. But \(1+2+3+\ldots = -\frac{1}{12}\) isn’t like that. It’s the personal experience – our ability to also <em>naively</em> reason on the problem – that makes this example important.</p>
<p class="indent">I think that the correct response to this argument is to say “no, I don’t believe you”, and hold your ground. Because the claim made in the video is so absurd that, even if you believe the video is correct and made by experts and the string theory textbook actually says that, you should consider a wide range of other explanations as to “how it could have come to be that people are claiming this” before accepting that addition might work in such an unlikely way.</p>
<p class="indent">Not because you know about how infinite sums work better than a physicist or mathematician does, but because you know how mundane addition works just as well as they do, and if a conclusion this shattering to your model comes around – even to a layperson’s model of how addition works, that adding positive numbers to positive numbers results in bigger numbers –, then either “everything is broken” or “I’m going insane” or “they and I are somehow talking about different things”. If they’re definitely reputable and you’re definitely thinking clearly, Occam’s Razor should dictate that the third option be the most appealing.</p>
<p class="indent">That is, the unreasonable mathematical result is because the mathematician or physicist is talking about one “sense” of addition or equality, but it’s not the same one that you’re using when you do everyday sums or when you apply your intuitions about intuition to everyday life. This is by far the simplest explanation: addition works just how you thought it does, even in your inexpertise; you and the mathematician are just talking past each other somehow, and you don’t have to know what way that is to be pretty sure that it’s happening. Anyway, there’s no reason expert mathematicians can’t be amateur communicators, and even that is a much more palatable result than what they’re claiming.</p>
<p class="indent">When we’re trying to figure out what’s true in the world, we take our existing model and incorporate evidence. That can be arguments, or the credentials of the arguer, or our confidence in our own ability to think rationally. And it’s important that, in the process of trying to hang on through this mess of evidence-evaluation, that we can be <em>confident about our basic understanding</em>– or we get into situations like this one, where we’re told something absurd but have been worn down to the point of not being sure that we are epistemologically ‘allowed’ to reject it.</p>
<p>In short: yes, you know how arithmetic works, and it doesn’t work like this. You are right to say “no, you’re messing with me”, no matter how many Youtube videos there are about it. As it happens, my view is that any trained mathematician who claims that \(1+2+3+4+5\ldots = -\frac{1}{12}\) without ample qualification is so incredibly confused or poor at communicating or actually just evil that they ought to be sent back to school.</p>
<hr />
<h2 id="postscript-the-explanation">Postscript: The Explanation</h2>
<p>There’s no shortage of explanations of this online, and a mountain of them emerged after this video became popular.</p>
<p>I’ll write out a simple heuristic version anyway for if you’re curious.</p>
<p class="indent">It turns out that, yes, there is a <em>sense</em> in which those summations are valid, but it’s not the sense you’re using when you perform ordinary addition. It’s also true that the summations emerge in physics. It’s also true that the validity of these summations is <em>in spite of</em> the rules of “you can’t add, subtract, or otherwise deal with infinities, and yes all these sums diverge” that you learn in introductory calculus; it turns out that those rules are also elementary and there are ways around them but you have to be very rigorous to get them right.</p>
<p class="indent">An elementary explanation of what happened in the proof is that, in all three infinite sum cases, it is possible to interpret the infinite sum in a more accurate form (but STILL not precise enough to use for regular arithmetic, because infinities are very much not valid, still, we’re serious), which is by understanding that the result is “actually” something like this:</p>
\[S(\infty) = 1+2+3+4+5\ldots \approx -\frac{1}{12} + O(\infty)\]
<p>where \(S(n)\) is a function giving the n’th partial sum of the series, and \(S(\infty)\) is what happens you formally extend \(S(n)\) to take a limit at \(n \ra \infty\). The \(O(\infty)\) part means “something on the order of infinity”.</p>
<p class="indent">That \(O(\infty)\) bit is in there, but doesn’t necessarily disrupt arithmetic on the finite part, which is why algebraic manipulations still seem to work. And it’s true that this series is always found to have the finite part \(-\frac{1}{12}\), if you stick to a certain type of ‘valid’ manipulations. (Well, there may be other kinds of summation techniques that get different results. But this value is not just randomly one among many associated with this summation; you can get this same answer in different ways.)</p>
<p class="indent">In fact, if you just graph the partial sums \(S(n)\), the curve which approximates this <a href="https://en.wikipedia.org/wiki/1_%2B_2_%2B_3_%2B_4_%2B_%E2%8B%AF">apparently</a> intercepts the \(y\) axis at \((0, -\frac{1}{12}\)).</p>
<p class="indent">And the fact that the series emerges in physics is complicated but amounts to the fact that, in the particular way we’ve glued math onto physical reality, we’ve constructed a framework that also doesn’t care about the infinity term (it’s rejected as “nonphysical!” and it’s a nightmare), and so we get the right answer despite dubious math. But physicists are fine with that, because it seems to be working and they don’t know a better way to do it yet.</p>
<p class="indent">So there you go. The sense in which it is true is: when you go mucking about with infinity, you make new rules and new definitions, and weird facts emerge. Until we’ve got a new theory of infinity, it remains “kind of true” that \(1+2+3+\ldots = - \frac{1}{12}\), but please don’t go telling anyone that to blow their mind, because you’d probably be more wrong than right.</p>
<hr />
<p>(This is adapted from something I posted on <a href="http://lesswrong.com/r/discussion/lw/oht/infinite_summations_a_rationality_litmus_test/">LessWrong</a> a few years ago. Copied here for posterity and improvement.)</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:lies" role="doc-endnote">
<p>Incidentally, if all your cognitive tools <em>are</em> lies, then you also don’t have much to worry about, because you’ve lost the ability to rationally process reality, so it’s not going to make a big difference who you believe about abstruse things because you’re going to be wrong about everything anyway. <a href="#fnref:lies" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Exterior Algebra Notes #2: the Inner Product2018-10-09T00:00:00+00:00http://alexkritchevsky.com/2018/10/09/exterior-2<p><em>(See this <a href="/2018/10/08/exterior-1.html">previous post</a> for some of the notations used here.)</em></p>
<p><em>(Not intended for any particular audience. Mostly I just wanted to write down these derivations in a presentable way because I haven’t seen them from this direction before.)</em></p>
<p><em>(Vector spaces are assumed to be finite-dimensional and over \(\bb{R}\))</em></p>
<p>Exterior algebra is obviously useful any time you’re anywhere near a cross product or determinant. I want to show how it also comes with an inner product which can make certain formulas in the world of vectors and matrices vastly easier to prove.</p>
<!--more-->
<hr />
<h2 id="1-the-inner-product">1. The Inner Product</h2>
<p>Euclidean vectors have an inner product that we use all the time. Multivectors are just vectors. What’s their inner product? They’re just vectors, so we should be able to sum over the components of two multivectors the same way we do for vectors with \(\< \b{u} , \b{v} \> \equiv \sum_{i \in V} u_i v_i\), like this:</p>
\[\boxed{\< \b{u} , \b{v} \> \equiv \sum_{\^ I \in \^^k V} u_{\^ I} v_{\^ I}} \tag{1}\]
<p>Which lets us compute the magnitudes of areas in the expected way:</p>
\[\| a \b{x \^ y} + b \b{y \^ z} + c \b{z \^ x} \|^2 = a^2 + b^2 + c^2\]
<p>This turns out to work, although the usual presentation is pretty confusing. The <a href="https://en.wikipedia.org/wiki/Exterior_algebra#Inner_product">standard way</a> to define inner products on the exterior algebra \(\^^k V\), extending the inner product defined on the underlying vector space \(V\), looks like this:</p>
\[\< \bigwedge_{i=1}^k \b{a}_i , \bigwedge_{i=1}^k \b{b}_i \> = \det \< \b{a}_i, \b{b}_j \>\]
<p>This is then extended linearly if either argument is a sum of multivectors. This expression is pretty confusing. It turns out be the same as (1), but it takes a while to see why.</p>
<p>The left side of the standard definition is the inner product of two \(k\)-vectors (each are the wedge product of \(k\) factors together); the right side is the determinant of a \(k \times k\) matrix. For instance:</p>
\[\< \b{a}_1 \^ \b{a}_2 , \b{b}_{1} \^ \b{b}_{2} \> =
\begin{vmatrix}
\< \b{a}_1 , \b{b}_1 \> & \< \b{a}_1 , \b{b}_2 \> \\
\< \b{a}_2 , \b{b}_1 \> & \< \b{a}_2 , \b{b}_2 \> \end{vmatrix}\]
<p>Simple examples:</p>
\[\begin{aligned} \< \b{x\^ y} , \b{x \^ y} \> &= 1 \\
\< \b{x\^ y} , \b{y \^ x} \> &= -1 \\
\< \b{x\^ y} , \b{y \^ z} \> &= 0 \\
\end{aligned}\]
<p>If we label the basis \(k\)-vectors using multi-indices \(I = (i_1, i_2, \ldots i_k)\), where no two \(I\) contain the same set of elements up to permutation, then this amounts to saying that basis multivectors are orthonormal:<sup id="fnref:id" role="doc-noteref"><a href="#fn:id" class="footnote" rel="footnote">1</a></sup></p>
\[\boxed{\< \b{x}_{\^ I} , \b{x}_{\^ J} \> = 1_{IJ}}\]
<p>And then extending this linearly to all elements of \(\^^k V\), <sup id="fnref:sign" role="doc-noteref"><a href="#fn:sign" class="footnote" rel="footnote">2</a></sup> which gives (1).</p>
<p>This gives an orthonormal basis on \(\^^k V\), and the first thing we’ll do is define the ‘\(k\)-lengths’ of multivectors, in the same way that we compute the length of a vector \(\| \b{v} \| = \< \b{v} , \b{v} \>\):</p>
\[\| \bigwedge_i \b{a}_i \| = \< \bigwedge_i \b{a}_i , \bigwedge_i \b{a}_i \> = \det \< \b{a}_i , \b{a}_j \>\]
<p>This is called the <a href="https://en.wikipedia.org/wiki/Gramian_matrix">Gram determinant</a> of the ‘Gramian’ matrix formed by the vectors of \(\b{a}\). It’s non-zero if the vectors are linearly independent, which clearly corresponds to the wedge product \(\bigwedge_i \b{a}_i\) not being \(=0\) in the first place.</p>
<p>It turns out that multivector inner products show up in disguise in a bunch of vector identities.</p>
<hr />
<h2 id="2-computation-of-identities">2. Computation of Identities</h2>
<p>Let’s get some practice with (1).</p>
<p>In these expressions, I’m going to be juggling multiple inner products at once. I’ll denote them with subscripts: \(\<,\>_{\^}\), \(\<,\>_{\o}\), \(\<,\>_{V}\).</p>
<p>The types are:</p>
<ul>
<li>the underlying inner product on \(V\), which only acts on vectors: \(\< \b{u}, \b{v} \>_V = \sum_i u_i v_i\).</li>
<li>the induced inner product on \(\o V\), which acts on tensors of the same grade term-by-term: \(\< \b{a \o b}, \b{c \o d} \>_{\o} = \< \b{a , c} \>_V \< \b{b, d } \>_V\)</li>
<li>the induced inner product on \(\^ V\), which we described above: \(\< \b{a \^ b}, \b{c \^ d} \>_{\^} = \< \b{a , c} \>_V \< \b{b, d } \>_V - \< \b{a , d} \>_V \< \b{b, c } \>\).</li>
</ul>
<p>Let \(\text{Alt}\) be the <em>Alternation Operator</em>, which takes a tensor product to its total antisymmetrization, e.g. \(\text{Alt}(\b{a \o b}) = \b{a \o b - b \o a}\). For a tensor with \(N\) factors, there are \(N!\) components in the result.<sup id="fnref:factorial" role="doc-noteref"><a href="#fn:factorial" class="footnote" rel="footnote">3</a></sup></p>
<p>The general procedure for computing \(\< \bigwedge_i \b{a}_i , \bigwedge_j \b{b}_j \>_\^\) by hand is to <strong>expand one side into a tensor product and the other into an antisymmetrized tensor product.</strong>. Which side is which doesn’t matter, so let’s standardize by putting it on the right.<sup id="fnref:alt" role="doc-noteref"><a href="#fn:alt" class="footnote" rel="footnote">4</a></sup> If you put it on both sides, you would need to divide the whole expression by \(\frac{1}{N!}\), which is annoying (but some people do it).</p>
\[\begin{aligned}
\< \bigwedge_i \b{a}_i , \bigwedge_j \b{b}_j \>_{\^} &= \det \< \b{a}_i , \b{b}_j \>_V \\
&= \sum_{\sigma \in S_k} \sgn(\sigma) \prod_i \< \b{a}_i, \b{b}_{\sigma(i)} \>_V \\
&= \< \bigotimes_i \b{a}_i , \sum_{\sigma \in S_k} \sgn(\sigma) \bigotimes_j \b{b}_{\sigma(j)}\>_{\o} \\
&= \< \bigotimes_i \b{a}_i, \text{Alt}(\bigotimes_j \b{b}_j) \>_{\o} \\
&= \< \text{Alt}(\bigotimes_i \b{a}_i), \bigotimes_j \b{b}_j \>_{\o} \end{aligned}\]
<p>Here’s an example of this on bivectors:</p>
\[\begin{aligned}
\< \b{a \^ b }, \b{c \^ d} \>_\^ &= \< \b{a \o b}, \b{c \o d} - \b{d \o c} \>_{\o} \\
&= \< \b{a , c} \>_V \< \b{b, d } \>_V - \< \b{a , d} \>_V \< \b{b, c } \>_V \tag{2}
\end{aligned}\]
<hr />
<p>Now, some formulas which turn out to be the multivector inner product in disguise.</p>
<p>The inner product between two areas vectors, as we have seen, is</p>
\[(\b{a \^ b}) \cdot (\b{c \^ d}) = (\b{a \cdot c}) (\b{b \cdot d}) - (\b{a \cdot d})(\b{b \cdot c})\]
<p>Set \(\b{a = c}\), \(\b{b = d}\) in (2) and relabel to get <a href="https://en.wikipedia.org/wiki/Lagrange%27s_identity">Lagrange’s Identity</a>:</p>
\[| \b{a} \^ \b{b} |^2 = | \b{a} |^2 | \b{b} |^2 - (\b{a} \cdot \b{b})^2\]
<p>If you’re working in \(\bb{R}^3\), a lot of familiar updates are found after turning \(\^\) into the cross product \(\times\) using the the <a href="https://en.wikipedia.org/wiki/Hodge_star_operator">Hodge Star map</a> \(\star\). We haven’t studied this yet, but in \(\bb{R}^3\) it usually means that formulas that hold for \(\^\) also just hold for \(\star\).</p>
<p>Transforming our bivector inner product, we immediately get the <a href="https://en.wikipedia.org/wiki/Binet%E2%80%93Cauchy_identity">Binet-Cauchy identity</a>:</p>
\[\begin{aligned}
(\b{a \^ b}) \cdot (\b{c \^ d}) &= (\b{a \times b}) \cdot (\b{c \times d})\\
&= \b{(a \cdot c) (b \cdot d) - (a \cdot d) (b \cdot c)}
\end{aligned}\]
<p>The trivector version gives the product of two <a href="https://en.wikipedia.org/wiki/Triple_product">scalar triple products</a>, which is quite a bit harder to see without this framework:</p>
\[\< \b{a \^ b \^ c}, \b{x \^ y \^ z} \> = (\b{a} \cdot (\b{b \times c})) (\b{x} \cdot (\b{y \times z})) \\
= \det \begin{pmatrix} \b{a \cdot x} & \b{a \cdot y} & \b{a \cdot z} \\
\b{b \cdot x} & \b{b \cdot y} & \b{b \cdot z} \\
\b{c \cdot x} & \b{c \cdot y} & \b{c \cdot z} \end{pmatrix}\]
<p>Set \(\b{a=c}\), \(\b{b = d}\) in the two-vector version to get a more familiar version of Lagrange’s identity:</p>
\[| \b{a} \times \b{b} |^2 = | \b{a} |^2 | \b{b} |^2 - (\b{a} \cdot \b{b})^2\]
<p>Drop the cross product term to get <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Schwarz_inequality">Cauchy-Schwarz</a>:</p>
\[(\b{a} \cdot \b{b})^2 \leq | \b{a} |^2 | \b{b} |^2\]
<p>I thought that was neat. Maybe there are places where Cauchy-Schwarz is used where in fact Lagrange’s identity would be more useful?</p>
<p>On vectors in \(\bb{R}^2\) (or any dimension), of course, the vector magnitude of course gives the Pythagorean theorem:</p>
\[| a \b{x} + b \b{y} |^2 = a^2 + b^2 = c^2\]
<p>This generalizes to the bivector areas of an orthogonal tetrahedron (or \((n-1)\)-vector surface areas of a \(n\)-simplex in any dimension), which is called <a href="https://en.wikipedia.org/wiki/De_Gua%27s_theorem">De Gua’s Theorem</a>. For instance in \(\bb{R}^3\), the surface area of the ‘hypotenuse’ side of an orthogonal tetrahedron can be computed from the sum of the areas of its other faces:</p>
\[\| a \b{x \^ y} + b \b{y \^ z} + c \b{z \^ x} \| ^2 = a^2 + b^2 + c^2\]
<p>This is because the total surface area bivector for a closed figure in \(\bb{R}^3\) is \(0\), so the surface area bivector of the opposing face is exactly \(-(a \b{x \^ y} + b \b{y \^ z} + c \b{z \^ x} )\).</p>
<p>There is naturally a version of the law of cosines for any tetrahedron/ \(n\)-simplex with non-orthogonal sides as well. If \(\vec{c} = \vec{a} + \vec{b}\) then (though it’s often stated with \(c = b-a\) instead):</p>
\[\begin{aligned}
\| \vec{c} \|^2 &= \|\vec{a}\|^2 + \|\vec{b}\|^2 + 2 \vec{a} \cdot \vec{b}
\\ &=\|\vec{a}\|^2 + \|\vec{b}\|^2 + 2 \| \vec{a} \| \| \vec{b}\| \cos \theta_{ab}
\end{aligned}\]
<p>We can easily expand \(\| \b{a} + \b{b} + \b{c} \|^2\) linearly when \(a,b,c\) are bivectors or anything else; the angles in the cosines become <a href="https://en.wikipedia.org/wiki/Dihedral_angle">angles between planes</a>, or something fancier, but the formula is otherwise the same:</p>
\[\| \b{a} + \b{b} + \b{c} \|^2 = |\b{a}|^2 + |\b{b}|^2 + |\b{c}|^2 + 2(\b{a} \cdot \b{b} + \b{a} \cdot \b{c} + \b{b} \cdot \b{c})\]
<p>Which is kinda cool.</p>
<hr />
<h2 id="3-matrix-multiplication">3. Matrix Multiplication</h2>
<p>Here’s my favorite thing that is easily understood with \(\<,\>_\^\): the generalizations of \(\det(BA) = \det(B) \det(A)\).</p>
<p>Let \(A: U \ra V\) and \(B: V \ra W\) be linear transformations. Their composition \(B\circ A\) has matrix representation:</p>
\[(BA)_i^k = \sum_{j \in V} A_i^j B_j^k = \<A_i, B^k \>\]
<p>The latter form expresses the fact that each matrix entry in \(BA\) is an inner product of a column of \(A\) with a row of \(B\).</p>
<p>Because \(A^{\^ q} : \^^q U \ra \^^q V\) and \(B^{\^ q} : \^^q V \ra \^^q W\) are also linear transformations, their composition \(B^{\^ q} \circ A^{\^ q} : \^^q U \ra \^^W\) also has a matrix representation:</p>
\[(B^{\^ q} A^{\^ q})_I^K = \sum_{J \in \^^q V} (A^{\^ q})_I^J (B^{\^ q})_J^K = \< A^{\^ q}_I, (B^{\^ q})^K \>\]
<p>Where \(I,J,K\) are indexes over the appropriate \(\^^q\) spaces.</p>
<p>\(A^{\^ q}_I\) is the wedge product of the \(I = (i_1, i_2, \ldots i_q)\) columns of \(A\), and \((B^{\^ q})^K\) is the wedge product of \(q\) rows of \(B\) from \(K\), which means this is just the inner product we discussed above.</p>
\[(B^{\^ q} A^{\^ q})_I^K = \det_{i \in I, k \in K} \< A_{i}, B^{k} \>\]
<p>But this is just the determinant of a minor of \((BA)_i^k\) – the one indexed by \((I,K)\). This means that:</p>
\[(B^{\^ q} A^{\^ q})_I^K = ((BA)^{\^ q})_I^K\]
<p>And thus:</p>
\[\boxed{B^{\^ q} A^{\^ q} = (BA)^{\^ q}} \tag{3}\]
<p>This is called the <a href="https://en.wikipedia.org/wiki/Cauchy%E2%80%93Binet_formula">Generalized Cauchy-Binet formula</a>. Note that \((3)\) does not require that the matrices be the square.</p>
<p>Which is neat. I think this version is <em>way</em> easier to remember or use than the version in Wikipedia, which is expressed in terms of matrix minors and determinants everywhere.</p>
<p>Some corollaries:</p>
<p>When \(A\) and \(B\) are in the same space and \(q = \dim V\), then all of the wedge powers turn into determinants, giving</p>
\[\det(BA) = \det(B) \det(A)\]
<p>If \(B = A^T\), then even if \(A\) is not square, the determinant of the square matrix \(A^T A\) is the sum of squared determinants of minors of \(A\). If \(A\) is \(n \times k\), this is a sum over all \(k \times k\) minors of \(A\):</p>
\[\begin{aligned}
\det (A^T A) &= \^^k (A^T A) \\
&= \^^k A^T \^^k A \\
&= \sum_{I \in \^^k U} \sum_{J \in \^^k V} (A_I^J)^2
\end{aligned}\]
<hr />
<p>Other articles related to Exterior Algebra:</p>
<ol start="0">
<li><a href="/2018/08/06/oriented-area.html">Oriented Areas and the Shoelace Formula</a></li>
<li><a href="/2018/10/08/exterior-1.html">Matrices and Determinants</a></li>
<li><a href="/2018/10/09/exterior-2.html">The Inner product</a></li>
<li><a href="/2019/01/26/hodge-star.html">The Hodge Star</a></li>
<li><a href="/2019/01/27/interior-product.html">The Interior Product</a></li>
<li><a href="/2020/10/15/ea-operations.html">All the Exterior Algebra Operations</a></li>
</ol>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:id" role="doc-endnote">
<p>\(1_{ij}\) is a Kronecker delta. I like this notation better than \(\delta_{ij}\) because the symbol \(\delta\) has too many meanings. <a href="#fnref:id" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sign" role="doc-endnote">
<p>If we don’t specify that all of our multi-indices are unique up to permutation, then we would have to write something like \(\< \b{x}_{\^ I}, \b{x}_{\^ J} \> = \sgn(I, J)\), where \(\text{sgn}\) is the sign of the permutation that takes \(I\) to \(J\), since for instance \(\< \b{x \^ y} , \b{y \^ x} \> = -1\). <a href="#fnref:sign" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:factorial" role="doc-endnote">
<p>There are several conventions for defining \(\text{Alt}\). It often comes with a factor of \(\frac{1}{N!}\). If you wanted it to preserve vector magnitudes, you might instead use \(\frac{1}{\sqrt{N!}}\). I prefer to leave it without any factorials, because it makes other definitions much easier. <a href="#fnref:factorial" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:alt" role="doc-endnote">
<p>This will matter later when we define the interior product the same way, but it’s still a matter of preference. <a href="#fnref:alt" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Exterior Algebra Notes #1: Matrices and Determinants2018-10-08T00:00:00+00:00http://alexkritchevsky.com/2018/10/08/exterior-1<p><em>(This is not really an intro to the subject. I don’t have an audience in mind for this. I’ve written my notes out in an expository style because it helps me retain what I study.)</em></p>
<p><em>(Vector spaces are assumed to be finite-dimensional and over \(\bb{R}\) with the standard inner product unless otherwise noted.)</em></p>
<p><a href="https://en.wikipedia.org/wiki/Exterior_algebra">Exterior algebra</a> (also known as ‘multilinear algebra’, which is arguably the better name) is an obscure and technical subject. It’s used in certain fields of mathematics, primarily abstract algebra and differential geometry, and it comes up a lot in physics, often in disguise. I think it ought to be <em>far</em> more widely studied, because it turns out to take a lot of the mysteriousness out of the otherwise technical and tedious subject of linear algebra. But most of the places it turns up it is very obfuscated. So my aim is to study exterior algebra and do some ‘refactoring’: to make it more explicit, so it seems like a subject worth studying in its own right.</p>
<p>In general I’m drawn to whatever makes computation and intuition simple, and this is it. In college I learned about determinants and matrix inverses and never really understood how they work; they were impressive constructions that I memorized and then mostly forgot. Exterior Algebra turns out to make them into simple intuitive procedures that you could rederive whenever you wanted.</p>
<!--more-->
<hr />
<h2 id="1-an-example-problem">1. An example problem</h2>
<p>Suppose: you’re writing a computer graphics library. You want to draw a bunch of objects lit up by the sun, which is in a fixed direction in the sky. Each surface will be drawn with a brightness based on the angle it makes with the light. After all, a surface pointed directly at the sun should be bright, almost white, while a surface perpendicular to it won’t be illuminated at all. To represent the angle each surface faces, you store the <em>normal vector</em> \(\b{n}\), which points out of the surface. The brightness scaling factor \(c\) of that surface is given by the dot product of the normal with the direction of the sun:</p>
\[c = \b{n} \cdot \b{d}_{sun}\]
<p>And this works fine for a while. But eventually you want your illuminated object to be transformed, say, so you end up transforming its coordinates several times before doing this calculation. You store the transformation as a matrix \(A\), so a vertex transforms is now \(A \b{v}\). When it comes time to compute how the <em>normals</em> transform in the lighting calculation, you go with the obvious choice: \(\b{n} \mapsto A \b{n}\). But nothing looks right – every surface is the wrong brightness. You’re confused. You give up and search online and find something like <a href="https://webgl2fundamentals.org/webgl/lessons/webgl-3d-lighting-directional.html">this</a>:</p>
<blockquote>
<p>There is one problem which I don’t know how to show directly so I’m going to show it in a diagram. We’re multiplying the normal by the u_world matrix to re-orient the normals. What happens if we scale the world matrix? It turns out we get the wrong normals. <br />
…<br />
I’ve never bothered to understand the solution but it turns out you can get the inverse of the world matrix, transpose it, which means swap the columns for rows, and use that instead and you’ll get the right answer.</p>
</blockquote>
<p>And at that point you either shrug and say “alright” and compute \(\b{n} \mapsto (A^{-1})^T \b{n}\)… or you quit learning computer graphics entirely and go back to trying to understand exterior algebra. That’s what I did. It’s a deep rabbit hole.</p>
<p>There are other sources where you can learn all about bivectors and multivectors and everything else. I mention this to say that this stuff isn’t just abstract formal nonsense; it’s got actual implications if you use vector mathematics in real life.</p>
<p>The answer, by the way, is that a normal vector that we store as \(\b{n} = n_x \b{x} + n_y \b{y} + n_z \b{z}\) is, geometrically, a bivector, since it represents a unit of area. We’re just storing it as a vector. Its true value is</p>
\[\b{n} = n_x \b{y \^ z} + n_y \b{z \^ x} + n_z \b{x \^ y}\]
<p>Bivectors created from elements of \(\bb{R}^3\) are elements of a vector space called \(\^^2 \bb{R}^3\), which is spanned by \(\{ \b{ x \^ y, y \^ z, z \^ x }\}\). If vectors are transformed by a matrix \(A\), then bivectors transform according to a matrix called \(A^{\^ 2}\), which maps \(\^^2 \bb{R}^3 \ra \^^2 \bb{R}^3\) and is defined by the property:</p>
\[(A^{\^ 2}) (\b{x \^ y}) \equiv A(\b{x}) \^ A(\b{y})\]
<p>and likewise for \(\b{y} \^ \b{z}\) and \(\b{z \^ x}\). Applying this to \(\b{n}\):</p>
\[\begin{aligned}
A^{\^ 2} [n_x \b{y \^ z} + n_y \b{z \^ x} + n_z \b{x \^ y}] = \; &n_x (A\b{y}) \^ (A\b{z}) \\
+ \; &n_y (A \b{z}) \^ (A \b{x}) \\
+ \; &n_z (A \b{x}) \^ (A \b{y})
\end{aligned}\]
<p>And it turns out that in three dimensions, \((A^{-1})^T = A^{\^2}\) (once you map \(\^^2 \bb{R}^3\) back to \(\bb{R}^3\)), which is how the computer graphics people get away with it.</p>
<p>For a concrete example, suppose our matrix is \(A = \text{diag}(1, 1, 2)\). So it doubles the \(z\) coordinate and leaves the other directions unchanged. The surface normal of an \(xy\) plane would naively be \(\b{z}\); after transformation, this is \(2 \b{z}\). But the bivector \(\b{x ^ y}\) is unchanged: \(A^{\^2} (\b{x \^ y}) = A\b{x} \^ A \b{y} = \b{x \^ y}\) in the new coordinate system, which corresponds to the vector \(\frac{1}{2} \b{z} = A^{-1} \b{z}\).</p>
<p>This is a fairly trivial example (you’d probably be normalizing \(\b{n}\) anyway to be a unit vector!), but it shows how you naively using vectors to represent areas can go wrong, and it’s not surprising that in more complicated procedures that will fail more disastrously.</p>
<hr />
<h2 id="2-multivector-matrices">2. Multivector Matrices</h2>
<p>Here are some notes about how linear transformations work on multivectors, because it took me a while to understand this and I had to write it out.</p>
<p>Some useful notation for multivectors:</p>
<p>A multivector’s components are indexed by the basis elements of the space it lives in, so, just as a vector \(\b{v}\) has a \(v_x\) component, a bivector \(\b{b}\) has a \(b_{x \^ y}\) component. You could ask for the \(b_{y \^ x} = b_{- x \^ y}\) coordinate, but it’s unnecessary. It’s like writing \(v_{-x}\) for \(- v_x\).</p>
<p>Note that the orderings of basis multivectors like \(\b{x \^ y}\) vs \(\b{y \^ x}\) are arbitrary. It’s convenient to use to use \(\b{(y \^ z, z \^ x, x \^ y)}\) as a basis for \(\^^2 \bb{R}^3\), but the math doesn’t care – different orderings are a basis change away. Same as how you’re free to use \((\b{z}, \b{x}, \b{y})\) as the basis of \(\bb{R}^3\), if you want.</p>
<p>I like to use <em>multi-indexes</em> to refer to multivector components, which are labeled with capital letters like \(I\) and refer to tuples: \(I = (i_1, i_2, \ldots i_k)\) (usually it’s obvious what \(k\) is). \(\b{v}_{\^ I}\) means:</p>
\[\b{v}_{\^ I} = \b{v}_{i_1} \^ \b{v}_{i_2} \^ \ldots \^ \b{v}_{i_k} \in \^^k V\]
<p>I like to abbreviate multi-indexes with just their digits, so \(\b{v}_{\b{x}_1 \^ \b{x}_2}\) might be</p>
<hr />
<h3 id="21-linear-transformations-on-k-v">2.1 Linear Transformations on \(\^^k V\)</h3>
<p>Suppose we have a linear transformation \(A : U \ra V\). For any \(k\), \(A\) also generates as a linear transformation \(A^{\^ k} : \^^k U \ra \^^k V\), which we write as \(A^{\^ k}\) (some people write it as \(\^^k A\)):</p>
\[A^{\^ k} \in \^^k U \ra \^^k V\]
<p>\(A^{\^ k}\) is defined by its action on any vector \(\b{u}_{\^ I} = \b{u}_{i_1} \^ \b{u}_{i_2} \ldots \^ \b{u}_{i_k} \in \^^k U\):</p>
\[A^{\^ k} (\b{u}_{\^ I}) \equiv A(\b{u}_{i_1}) \^ A(\b{u}_{i_2}) \^ \ldots A(\b{u}_{i_k})\]
<p>Since \(A^{\^ k}\) is just a linear transformation, we can write it out as a matrix. If \(\dim U = m\) and \(\dim V = n\), then \(\dim \^^k V = {n \choose k}\) and \(\dim \^^k U = {m \choose k}\), so \(A^{\^ k}\) is an \({m \choose k} \times {n \choose k}\) matrix.</p>
<p>The key insight for understanding matrices such as \(A^{\^ k}\) is that they are <em>just matrices</em>, albeit in a different vector space (\(\^^k U \ra \^^k V\)), and all of the usual matrix knowledge applies.</p>
<p>Each column of \(\^^k V\) is the action of \(A\) on a particular \(k\)-vector, which results in a linear combination of \(k\)-vectors. It does not matter what order we write the columns in.</p>
<p>Important special cases:</p>
<ul>
<li>\(A^{\^ 1}\) is just \(A\).</li>
<li>If \(A\) is square, then, setting \(\dim U = \dim V = n\), we have \(A^{\^ n} = \det (A) \b{u}_{\^ I}^T \o \b{v}_{\^ I}\). This is the determinant of \(A\), considered as a linear transformation from \(\^^n V \ra \^^n V\), meaning that it has basis vectors attached rather than being a dimensionless scalar.
<ul>
<li>The scalar value \(\det (A)\) is the coefficient of \(A^{\^ n}\)</li>
<li>Sometimes it’s useful to write the coefficient as a trace: \(\det (A) = \tr A^{\^ n}\)</li>
<li>\(A^{\^ n}\) is a linear transformation between <em>volumes</em> of \(U\) to volumes of \(V\).</li>
<li>Most of the properties of \(\det\) follow directly from this interpretation. (eg: non-linearly-independent matrix \(\ra\) degenerate volume \(\ra\) 0 determinant; composition of transformations \(\ra\) multiplicativity of determinants.)</li>
</ul>
</li>
<li>If \(U=V\) then \(A^{\^ n - 1}\) is the <a href="https://en.wikipedia.org/wiki/Adjugate_matrix">adjugate matrix</a> of \(A\), except that it is written in the \(\^^{n-1} V\) basis, and \(A^{\^ n - 1} / \det(A)\) is \(A^{-1}\), except that it is written in the \(\^^{n-1} V\) basis. In a later post we will elaborate on the \(\star\) operator, which is the right way to map these back to the correct bases.</li>
<li>It’s useful to define \(A^{\^ 0}\) as \(1 \in \bb{R} \simeq \^^0 V\).</li>
</ul>
<p><strong>Example:</strong></p>
<p>Consider the sloped rectangle in \(\bb{R}^3\) formed by the points \(\{ a \b{x}, a\b{x} + \b{z}, b \b{y} + \b{z}, b \b{y} \}\). Two of its sides are given by \(\b{c}_1 = \b{z}\) and \(\b{c}_2 = b \b{y} -a \b{x}\), and its area is a bivector:</p>
\[\b{c} = \b{c}_1 \^ \b{c}_2 = \b{z} \^ (b \b{y} - a \b{x}) = a (\b{x} \^ \b{z}) - b (\b{y} \^ \b{z}) \in \^^2 \bb{R}^3\]
<p>It’s a linear combination of some area on the \(\b{xy}\) and \(\b{yz}\) planes. Its total scalar area can be computed by taking the magnitude of this bivector (\(\| \b{c} \| =\sqrt{a^2 + b^2}\)), which shows how areas and \(k\)-volumes in general obey a version of the Pythagorean theorem (more on that in the next post).</p>
<p>Suppose</p>
\[A = \begin{pmatrix} 0 & 0 & -1 \\ 2 & 0 & 0 \\ 0 & 3 & 0 \end{pmatrix}\]
<p>Its action on \(\b{c}\) is:</p>
\[\begin{aligned} A \b{c} &= a (A \b{x}) \^ (A \b{z}) - b (A \b{y} \^ A \b{z}) \\
&= a (2 \b{y} \^ - \b{x}) + b(3 \b{z} \^ -\b{x}) \\
&= 2a (\b{x \^ y}) - 3 b (\b{z \^ x})
\end{aligned}\]
<p>We see that \(A\) does not multiply \(\| \b{c} \| = \sqrt{a^2 + b^2}\) by a scalar, but rather takes it to \(\| \b{c} \| \ra \| A^{\^2} \b{c} \| = \sqrt{4a^2 + 9b^2}\). This shows how, if we want to consider the area of this rectangle as a geometric object, we should write it as its bivector representation rather than mapping it back to its scalar value.</p>
<hr />
<h3 id="22-multivector-matrix-notations">2.2 Multivector Matrix Notations</h3>
<p>In this section let \(U = V = \bb{R}^3\). Then:</p>
\[\begin{aligned}
A^{\^0} &= 1 \\
A^{\^1} &= A = (A \b{x}, A \b{y}, A \b{z}) \\
A^{\^2} &= (A^{\^2}(\b{x} \^ \b{y}), A^{\^2}(\b{y} \^ \b{z}) , A^{\^2}(\b{z} \^ \b{x}) ) \\
A^{\^3} &= A^{\^3}(\b{x \^ y \^ z}) = A \b{x} \^ A \b{y} \^ A \b{z} = \det (A) \; \b{x \^ y \^ z} \end{aligned}\]
<p>Let’s write out \(A^{\^2}\) componentwise. It’s a \(3 \times 3\) matrix: it has three columns and each of those is a bivector with three components.</p>
<p>To keep track of rows and columns I am going to use superscripts for column indexes and subscripts for row indexes. So, as you will, as you go down a column, the upper index changes but the lower index stays the same.</p>
<p>Here’s the \(\b{x \^ y}\) column of \(A^{\^ 2}\):</p>
\[A^{\^2} (\b{x \^ y}) = A(\b{x}) \^ A(\b{y})
=
\begin{pmatrix} A^{\^2}(\b{x \^ y})_{\b{x \^ y}} \\[3pt] A^{\^2}(\b{x \^ y})_{\b{y \^ z}} \\[3pt] A^{\^2}(\b{x \^ y})_{\b{z \^ x}} \end{pmatrix} = \begin{pmatrix} A_x^x A_y^y - A_x^y A_y^x \\[3pt] A_x^y A_y^z - A_x^z A_y^y \\[3pt] A_x^z A_y^x - A_x^x A_y^z \end{pmatrix}\]
<p>Here’s the \(\b{x}^T \^ \b{y}^T\) row of \(A^{\^ 2}\):</p>
\[\begin{aligned}
(\b{x}^T \^ \b{y}^T) (A^{\^2}) &= (\b{x}^T A) \^ (\b{y}^T A) \\
&= \big(
(A_x^x A_y^y - A_x^y A_y^x) ,
(A_y^x A_z^y - A_y^y A_z^x) ,
(A_z^x A_y^y - A_z^y A_x^x) \big) \\
\end{aligned}\]
<p>With a regular matrix we can extract components by contracting with a basis vector on each side; for instance \(A_y^x = \b{y} A \b{x}\). We can do the same here:</p>
\[(\b{y}^T \^ \b{z}^T) A^{\^2}(\b{x \^ y}) = (A^{\^2})^{x \^ y}_{y \^ z} = \begin{vmatrix} A_y^x & A_y^y \\ A_z^x & A_z^y \end{vmatrix} = A_y^x A_z^y - A_y^y A_z^x\]
<p>Each component \(A_{\^ I}^{\^ J}\) of \(A^{\^ k}\) is the determinant of a \(k \times k\) <a href="https://en.wikipedia.org/wiki/Minor_(linear_algebra)">minor</a> of \(A\): the minor created by the columns \(I\) and the rows \(J\) (possible with a minus sign, depending on the permutations of your bases). I prefer this notation because it emphasizes exactly the meaning of that value: that it is a component of the matrix \(A^{\^ k}\).</p>
<p>Minors on the diagonal are called principle minors. The diagonal elements of \(A^{\^ k}\) are the determinants of these. For instance:</p>
\[(\b{x \^ y})^T A (\b{x \^ y}) = \begin{vmatrix} A_x^x & A_x^y \\ A_y^x & A_y^y \end{vmatrix} = A_x^x A_y^y - A_y^x A_x^y\]
<p>Note that we can take expand a matrix in its columns or rows first and get the same answer. Also note that in general, \((\^^k V)^T \simeq \^^k V^T\), at least in finite dimensional spaces (I think?), so we can compute either of \((A^{\^ k})^T = (A^T)^{\^ k}\). Either way you end up with coordinates on each of the pairs of the \(\^^k V\) basis vectors.</p>
<p>This notation gets pretty confusing. I propose that we allow ourselves to just write \(A_{x \^ y}^{x \^ y}\) when we mean \((A^{\^ 2})_{x \^ y}^{x \^ y}\). It’s smoothe. Here’s a summary:</p>
<aside>
<p><strong>Notation summary</strong></p>
<p>\(A_{x \^ y}^{y \^ z} = (A^{\^ 2})_{x \^ y}^{y \^ z}\), so we can refer to components of \(A^{\^ 2}\) by using multivector subscripts and superscripts. The resulting value is a scalar.</p>
<p>\(A_{x \^ y} = (A^{\^ 2})_{x \^ y}\) gives the \(\b{x \^ y}\) column of \(A^{\^ 2}\), so we can refer to columns by omitting row indexes.</p>
<p>\(A^{x \^ y} = (\b{x}^T A) \^ (\b{y}^T A)\) similarly extracts the \(\b{x} \^ \b{y}\) row of \(A^{\^ 2}\).</p>
<p>Using these we can expand \(A^{\^ 2}\) like this:</p>
\[A^{\^ 2} =
\begin{pmatrix}
A_{x \^ y}^{x \^ y} & A_{y \^ z}^{x \^ y} & A_{z \^ x}^{x \^ y} \\[5pt]
A_{x \^ y}^{y \^ z} & A_{y \^ z}^{y \^ z} & A_{z \^ x}^{y \^ z} \\[5pt]
A_{x \^ y}^{z \^ x} & A_{y \^ z}^{z \^ x} & A_{z \^ x}^{z \^ x} \end{pmatrix} =
\begin{pmatrix} A_{x \^ y} & A_{y \^ z} & A_{z \^ x} \end{pmatrix}
= \begin{pmatrix} A^{x \^ y} \\[5pt] A^{y \^ z} \\[5pt] A^{z \^ x} \end{pmatrix}\]
<p>Higher powers of \(A^{\^ k}\) are the same idea, but indexed by the basis multivectors \(\b{x}_{\^ I} \in \^^k V\).</p>
<p>For a further example: notice how these three equations say slightly different things, because each index that we use is a basis multivector which does not appear on the right side:</p>
\[A_{x \^ y \^ z} = \det (A) \, \b{x} \^ \b{y} \^ \b{z}\]
\[A_{x \^ y \^ z}^{x \^ y \^ z} = \det (A)\]
\[A^{\^3} = \det (A) \, (\b{x} \^ \b{y} \^ \b{z})^T \otimes (\b{x} \^ \b{y} \^ \b{z})\]
<p>Whenever possible, \(\det(A)\) will refer to the <em>scalar</em> value of \(A^{\^ n}\), while \(A^{\^ n}\) refers to the linear transformation \(\in U^{\^ n} \ra V^{\^ n}\).</p>
</aside>
<hr />
<h3 id="23-maps-between-spaces-with-different-dimensions">2.3 Maps between spaces with different dimensions</h3>
<p>If \(A\) is a map between two vector spaces with different dimensions, we can still talk about \(A^{\^ k}\). Intuitively, this is a linear transformation which maps areas (or \(k\)-volumes) in one space to areas (\(k\)-volumes) in the other, and there’s no requirement that they have the same dimension for that to make sense.</p>
<p>If \(\dim U = m\) and \(\dim V = n\) and \(A : U \ra V\), then, as mentioned above, \(A^{\^ k}\) can be written as a \({m \choose k} \times {n \choose k}\) matrix.</p>
<p>Suppose \(A : \bb{R}^2 \ra \bb{R}^3\), and we label the \(\bb{R}^2\) by basis vectors \(\b{u,v}\). Then:</p>
\[\^^2 A = \begin{pmatrix}
A_{u \^ v}^{x \^ y} \\[5pt]
A_{u \^ v}^{y \^ z} \\[5pt]
A_{u \^ v}^{z \^ x}
\end{pmatrix}\]
<p>Concretely:</p>
\[A = \begin{pmatrix} a & b \\ c & d \\ e & f \end{pmatrix}\]
\[\^^2 A = \begin{pmatrix} ad-bc \\ cf-de \\ eb - fa \end{pmatrix}\]
\[\^^2 (A^T) = (\^^2 A)^T = \begin{pmatrix} ad - bc & cf - de & eb - fa \end{pmatrix}\]
<p>No matter what, \(A^{\^ k}\) is a map between \(\^^k U \ra \^^k V\), and if \(k > \dim U\) or \(k > \dim V\) then one of those spaces is zero-dimensional and \(A\) is just 0. In the case above, \(A^{\^ 3} = 0\).</p>
<p>Note that if the spaces don’t have the same dimension, then there’s no concept of a scalar ‘determinant’ to linear transformations between them.</p>
<p>(Technically there is nothing stopping us from discussing linear transformation which, say, maps areas in one space to volumes in another. But those are not going to be common – they are not geometrically ‘natural’. Besides the \(\star\) operator which maps \(\^^k V \lra \^^{n-k} V\), we mostly only use maps from \(k\)-vectors to \(k\)-vectors.)</p>
<hr />
<p>Other articles related to Exterior Algebra:</p>
<ol start="0">
<li><a href="/2018/08/06/oriented-area.html">Oriented Areas and the Shoelace Formula</a></li>
<li><a href="/2018/10/08/exterior-1.html">Matrices and Determinants</a></li>
<li><a href="/2018/10/09/exterior-2.html">The Inner product</a></li>
<li><a href="/2019/01/26/hodge-star.html">The Hodge Star</a></li>
<li><a href="/2019/01/27/interior-product.html">The Interior Product</a></li>
<li><a href="/2020/10/15/ea-operations.html">All the Exterior Algebra Operations</a></li>
</ol>
Oriented Areas and the Shoelace Formula2018-08-06T00:00:00+00:00http://alexkritchevsky.com/2018/08/06/oriented-area<p>Here’s a summary of the concept of oriented area and the “shoelace formula”, and some equations I found while playing around with it that turned out not to be novel.</p>
<p>I wanted to write this article because I think the concept deserves to be better popularized, and it is useful to me to have my own reference on the subject. Some resources I have found, including Wikipedia, cite a 1959 monograph entitled <em>Computation of Areas of Oriented Figures</em> by A.M. Lopshits, originally printed in Russian and translated to English by Massalski and Mills, which I have not been able to find online. I did find a copy via university library, and I thought I would summarize its contents in the process to make them more available to a casual Internet reader.</p>
<p>I also wanted to practice making beautiful math diagrams. Which went okay, but god is it ever not worth the effort.</p>
<!--more-->
<hr />
<h2 id="1">1</h2>
<p>Let \(P\) be a polygon with vertices \(\{p_1, p_2, \ldots p_n\}\). Sometimes I will refer to \(p_0\) also, which is defined to equal \(p_n\), because it makes formulas neater.</p>
<p>Here’s a \(P\) with \(n=5\):</p>
<p><img src="/assets/posts/2018-08-06/01-polygon.svg" width="200px" /></p>
<p>The <em>signed</em> or <em>oriented</em> area of \(P\) is given by the so-called “<a href="https://en.wikipedia.org/wiki/Shoelace_formula">shoelace formula</a>”:</p>
\[Area(P) = \frac{1}{2} \sum_{i=0}^{n-1} p_i \times p_{i+1} \tag{1}\]
<p>where the sum wraps around, thanks to \(p_0\) being the same as \(p_n\). Each term is the area of the triangle formed by the origin, \(p_i\), and \(p_{i+1}\).</p>
<p>It’s called the shoelace formula because when you write all the coordinates in a column and begin to compute the cross products by multiplying \(x_1 y_2 - x_2 y_1\), \(x_2 y_3 - x_3 y_2\), etc, it’s reminiscent of a laced shoe:</p>
<p><img src="/assets/posts/2018-08-06/02-shoelace.svg" width="150px" /></p>
<p>“Signed” area means that the area is positive if its vertices go <em>counterclockwise</em>, and is negative if its vertices go <em>clockwise</em>.</p>
\[Area(-P) = Area(\{p_n, p_{n-1}, \ldots p_0\}) = -Area(P)\]
<p><img src="/assets/posts/2018-08-06/03-negative.svg" width="200px" /></p>
<p>You can remember which is which because counterclockwise / positive is also the direction that positive radians go, by convention. This is arbitrary, and we could have defined it the other way. If you just want the unsigned area of a region, you can always take the absolute value of this, so the signed area is strictly more powerful than the unsigned version.</p>
<p>Signed areas are useful because they are better-behaved than regular areas in several ways. The signed area of a shape is preserved under any decomposition into component shapes, with negatively-oriented components subtracting from the total area:</p>
<p><img src="/assets/posts/2018-08-06/04-subtraction.svg" width="500px" /></p>
<p>I’ve indicating that the circle is negatively oriented and thus has negative area with an arrow that says it is to be traversed clockwise. Its area is subtracted from the total area of the rectangle, giving the area of the composite shape automatically.</p>
<p>This becomes more useful in more complicated figures, because it lets us build them out of simple parts very cleanly. The shoelace formula works because of the ability to add these oriented areas without having to specify which ones to subtract. It amounts to decomposing a shape into a list of triangles with the origin as the third vertex, and adding their areas. This is totally natural if the origin is fully contained within the polygon:</p>
<p><img src="/assets/posts/2018-08-06/05-triangles.svg" width="200px" /></p>
<p>But signed areas mean that this construction works even if the origin is outside the polygon, with the triangles overlapping, because their overlapping parts cancel perfectly:</p>
<figure>
<p><img src="/assets/posts/2018-08-06/06-triangles2.svg" width="200px" /></p>
<figcaption>
<p>The dark areas cancel out of the total sum, because the (negative) area of \(p_1 p_2\mathcal{O}\) exactly cancels the excess positive areas in each of the other triangles \(p_2 p_3 \mathcal{O}, p_3 p_4 \mathcal{O}, p_4 p_0 \mathcal{O}\), and \(p_0 p_1 \mathcal{O}\).</p>
</figcaption>
</figure>
<p>The coordinate-invariance of this formula (that it works regardless of where \(\mathcal{O}\) is) should be enough to motivate it as mathematically valuable. We like formulas that don’t care about specific coordinate systems.</p>
<p>We’re used to dealing with what are called <em>simple</em> polygons – polygons whose sides never overlap, so they surround a single region of space without any intersections. We can also consider <em>non-simple</em> polygons, which are allowed to have their vertices or edges overlap or intersect. The shoelace formula continues to work if the polygon is <em>non-simple</em>, except we must understand that negatively-oriented regions subtract from the total sum instead of adding, which may not be totally intuitive:</p>
<figure>
<p><img src="/assets/posts/2018-08-06/07-nonsimple.svg" width="250px" /></p>
<figcaption>
<p>The total signed area here is <em>zero</em>, because the two oppositely-oriented components have the same magnitudes of areas, but opposite signs, so they cancel out.</p>
</figcaption>
</figure>
<hr />
<p>Related to the concept of oriented area is the concept of <em>oriented angle</em>, which is actually a bit more familiar. Oriented angles distinguish between “the angle between \(\b{a}\) and \(\b{b}\)” and “the angle between \(\b{b}\) and \(\b{a}\)”, by insisting that we specify <em>counterclockwise</em> angles (the way radians go) as positive:</p>
<p><img src="/assets/posts/2018-08-06/08-angles.svg" width="250px" /></p>
<p>This is very much like how the vector \(\bf{b-a}\) is the negative of the vector \(\bf{a-b}\). In fact it is appealing to think of oriented angles as some kind of curved vectors – we can add and subtract ‘angular’ vectors just like regular vectors. In this example, evidently \(\angle \b{ab} + \angle \b{bc} = \angle \b{ac}\), and this angle addition formula holds whether or not \(\b{c}\) falls between \(\b{a}\) and \(\b{b}\):</p>
<p><img src="/assets/posts/2018-08-06/09-angle-addition.svg" width="300px" /></p>
<p>We want to think of angles as oriented because we can use them in formulas to get oriented areas as results. The cross product of two vectors \(\b{a}, \b{b}\) gives the signed area of the parallelogram \((\b{0}, \b{a}, \b{a+b}, \b{b})\), regardless of their relative orientation:</p>
\[\b{a} \times \b{b} = |a||b| \sin \phi \\
\b{a} \times \b{b} = |a||b| \sin (- \phi) = - |a||b| \sin \phi\]
<hr />
<h2 id="2">2</h2>
<p>The shoelace formula can be massaged into some other forms. Defining \(\b{d}_i\) as the vector displacements of each side<sup id="fnref:vector" role="doc-noteref"><a href="#fn:vector" class="footnote" rel="footnote">1</a></sup>:</p>
\[\b{d}_i = p_{i+1} - p_i\]
<p>Which is just the same polygon labelled differently:</p>
<p><img src="/assets/posts/2018-08-06/10-displacements.svg" width="150px" /></p>
<p>Then we can write the area as:</p>
\[Area(P) = \frac{1}{2} \sum p_i \times \b{d}_{i} \tag{2}\]
<p>Since \(p_i \times p_i = 0\) and \(\times\) is distributive, this is the same as (1). Essentially we are just referring to our triangles differently, by \(p_i\) and \(\b{d}_i\) instead of \(p_i\) and \(p_{i+1}\):</p>
<p><img src="/assets/posts/2018-08-06/11-displacements2.svg" width="150px" /></p>
<p>The \(\b{d}_i\) vectors traverse the polygon starting at \(p_0\):</p>
\[p_j = p_0 + \sum_{i = 0}^{j-1} \b{d}_i\]
<p>They form a closed loop (\(\sum_{i=0}^{n} \b{d}_i = \b{0}\)). Therefore we can eliminate the \(p_i\) terms from the area formula entirely by writing them as sums of displacements (recalling that \(\sum_{i} p_0 \times \b{d}_i\) cancels out because it equals \(p_0 \times \sum_{i} \b{d}_i = p_0 \times \b{0} = 0\)):</p>
\[\begin{aligned} Area(P) &= \frac{1}{2} \sum_{i} (p_0 + \sum_{j < i} \b{d}_j) \times \b{d}_i \\
&=\frac{1}{2} [ \sum_{i} p_0 \times \b{d}_i + \sum_{i}\sum_{j < i} \b{d}_j \times \b{d}_i ]\end{aligned}\]
\[Area(P) = \frac{1}{2} \sum_{i}\sum_{j < i} \b{d}_j \times \b{d}_i \tag{3}\]
<p>This is the shoelace formula, rewritten to only use the vector displacements of the figure.</p>
<hr />
<p>Finally, we can write this in terms of just the side-lengths and exterior vertex angles of the polygon.</p>
<p>We may describe the angles of a polygon in several ways. The first is by the interior angle \(\alpha_i\) at vertex \(p_i\), with the stipulation that this is always the angle measured counterclockwise from the first side in our oriented order, so that it is always the interior on positively-oriented simple polygons:</p>
<p><img src="/assets/posts/2018-08-06/12-interior.svg" width="150px" /></p>
<p>Interior angles are probably the most intuitive, but they perform less well in equations than the <em>exterior</em> angle at each vertex, which is the angle between the vectors \(\b{d}_{i-1}\) and \(\b{d}_i\). Let’s call them \(\theta_i\), so \(\theta_i\) is the exterior angle at the point \(p_i\):</p>
<p><img src="/assets/posts/2018-08-06/13-exterior.svg" width="150px" /></p>
<p>Evidently \(\alpha_i + \theta_i = \pi\). Also, for a simple polygon, the sum of all the angular displacements at every vertex must add up to \(2 \pi \equiv 0\), because the angles make one complete circle.</p>
\[\sum \theta_i = \b{0}\]
<p><img src="/assets/posts/2018-08-06/14-complete.svg" width="250px" /></p>
<p>In a regular \(N\)-gon, each exterior angle must be equal, so \(\theta_i = \frac{2 \pi}{N}\), which means the interior angles are \(\alpha = \pi - \frac{2 \pi}{N}\) and the sum of the interior angles is \(N \alpha = N(\pi - \frac{2 \pi}{N}) = \pi (N-2)\).</p>
<p>The area of a regular \(N\)-gon is created from \(2N\) copies of the right triangle that has \(\frac{\alpha}{2}\) as one of its angles, which gives various formulas for computing their area depending on what lengths you have:</p>
<p><img src="/assets/posts/2018-08-06/15-regular.svg" width="200px" /></p>
<p>We can see that these relationships should hold, in case you’d like to go calculating some values from the others:</p>
\[\begin{aligned} \frac{L}{2} &= r \cos \frac{\alpha}{2} \\
s &= r \sin \frac{\alpha}{2} \\
s &= \frac{L}{2} \tan \frac{\alpha}{2} \\
Perimeter &= NL = 2 n r \cos \frac{\alpha}{2} \\
Area &= \frac{1}{2} NsL \\
Area &= \frac{1}{2} s \cdot Perimeter
\end{aligned}\]
<p>For a simple, positively-oriented polygon, the exterior angles add up to exactly \(2 \pi\) radians. For non-simple polygons they may add up to any multiple of \(2 \pi\) depending on how many clockwise or counterclockwise loops there are:</p>
<figure>
<p><img src="/assets/posts/2018-08-06/16-exterior2.svg" width="500px" /></p>
<figcaption>
<p>The first figure has \(\sum \theta_i = 2\pi\), the second has \(\sum \theta_i = -2 \pi\), and the third has \(\sum \theta_i = 0\).</p>
</figcaption>
</figure>
<p>Each adjacent displacement vector \(\b{d}_i\) differs from the previous vector \(\b{d}_{i-1}\) by the exterior angle between them \(\theta_i\).</p>
<p><img src="/assets/posts/2018-08-06/17-exterior3.svg" width="125px" /></p>
<p>Therefore we can find the angle between <em>any</em> two displacement vectors \(\b{d}_i, \b{d}_j\) by adding up all the exterior angles between them: (with the sum wrapping around if need be, and with addition be modulo \(2 \pi\)):</p>
\[\theta_{ij} = \sum_{i \leq k < j} \theta_k\]
<p>We can use this in \((3)\) to get a version of the area formula expressed only in lengths and exterior angles:</p>
\[\begin{aligned}
Area(P) &= \frac{1}{2} \sum_{i}\sum_{j < i} \b{d}_j \times \b{d}_i \\
&= \frac{1}{2} \sum_i \sum_{j < i} | \b{d}_j | | \b{d}_i | \sin (\theta_{ij}) \\
\end{aligned}\]
\[Area(P) = \frac{1}{2} \sum_i \sum_{j < i} | \b{d}_j | | \b{d}_i | \sin (\theta_{ij}) \tag{4}\]
<p>By labeling the side lengths \(\| \b{d}_i \|\) as \(a_{i+1}\) and expanding the sum over \(i\) before \(j\), we can get to a form which which is presented <a href="https://en.wikipedia.org/wiki/Polygon#cite_ref-lopshits_6-0">on Wikipedia</a>:</p>
\[\begin{aligned} Area(P) &= \frac{1}{2} (a_1 [a_2 \sin (\theta_1) + a_3 \sin (\theta_1 + \theta_2) + \ldots a_{n-1} \sin(\theta_1 + \theta_2 + \ldots + \theta_{n-1}) ] \\
&+ a_2 [ a_3 \sin (\theta_2) + a_4 \sin (\theta_2 + \theta_3) + \ldots + a_{n-1} \sin (\theta_2 + \theta_3 + \ldots + \theta_{n-1})] \\
&+ \dots + a_{n-2} \sin (\theta_{n-1})) \tag{5}
\end{aligned}\]
<p>This and (4) are two ways of expressing the same idea: the area of a polygon in terms of scalar lengths and angles. I am not sure when you would ever want to use these, though – these loops have \(O(N^2)\) steps in them, while the original formula (1) involved only \(N\).</p>
<hr />
<h2 id="3">3</h2>
<p>As mentioned, the Wikipedia article on <a href="https://en.wikipedia.org/wiki/Polygon">polygons</a> sources formula (5) from <em>Computation of Areas of Oriented Figures</em> by A.M. Lopshits, published 1959. It turns out some other people have chased that link and also cited this text, but I could not get my hands on a .pdf version, so I got it from the university library.</p>
<p>If you are curious what it contains, here’s an outline. The short version is: not much. I wanted more formulas and ideas in the vein of (5), but much deeper. Turns out, though, that’s it’s a mostly pedagogical text that reaches that formula as its final result after spending 40 pages on the concept of oriented area and related (elementary) geometric proofs.</p>
<p><span style="text-decoration:underline;">Outline of <em>Computation of Areas of Oriented Figures</em> by AM Lopshits</span></p>
<p><strong>Chapter 1. Measurement of the Area of an Oriented Figure</strong></p>
<ol>
<li>Oriented triangles are a lot like regular triangles, but oriented.</li>
<li>Oriented triangles have oriented areas.
<ul>
<li>The simplest way to see that this might be useful is this:</li>
<li>
<p>Consider a triangle \(ABC\). If \(A'\) is a point on the line segment \(BC\), then \(area(ABC) = area(A'AB) + area(A'CA)\).<br /><br />
<img src="/assets/posts/2018-08-06/18-triangle.svg" width="150px" /></p>
</li>
<li>
<p>If we allow ourselves <em>oriented</em> triangles and areas, this remains true even if \(A'\) is a point on the line \(\overrightarrow{BC}\), but <em>outside</em> the triangle:<br /><br />
<img src="/assets/posts/2018-08-06/19-triangle2.svg" width="200px" /></p>
</li>
<li>Lopshits makes all of these points arduously, in multiple theorems with elaborate proofs. I suspect the mid-century Russians loved proving things arduously.</li>
</ul>
</li>
<li>The area of an oriented triangle can be calculate using the shoelace formula for any choice of origin \(\mathcal{O}\).
<ul>
<li>this is carefully proven using previous theorems.</li>
</ul>
</li>
<li>Oriented polygons are oriented collections of points. The shoelace formula gives their area for any choice of \(\mathcal{O}\).
<ul>
<li>this is also carefully proven using previous theorems.</li>
</ul>
</li>
<li>
<p>Some examples and exercises. The most interesting set is like this (for \(N=8,12,20\)):</p>
<p>“A regular dodecagon \(A_1A_2 \ldots A_{12}\) is inscribed in a circle. The polygon \(A_1 A_6 A_5 A_{10} A_9 A_2\) has three points of self-intersection, \(C_1\), \(C_2\), \(C_3\). Prove that the area of the triangle \(C_1 C_2 C_3\) is three times the area of the triangle \(A_1 A_2 C_1\).”<br /><br />
<img src="/assets/posts/2018-08-06/20-dodecagon.svg" width="250px" /></p>
</li>
</ol>
<aside class="toggleable" id="solution" placeholder="<b>Solution</b> <em>(click to expand)</em>">
<p>Designate the center of the polygon as the origin \(\mathcal{O}\). The angle between adjacent points is \(\angle \mathcal{O} A_2 A_1 = \frac{360 \degree}{12} = 30 \degree\). Recall that the area of an angular segment with angle \(\theta\) of a circle is \(\frac{R^2}{2} \sin \theta\).</p>
<p>Noting that \(\mathcal{O} A_1 A_6\) has the opposite orientation of \(\mathcal{O} A_1 A_2\), the total signed area \(S\) of the figure \(A_1 A_6 A_5 A_{10} A_9 A_2\) is:</p>
\[\begin{aligned} S &= 3[area(\mathcal{O}A_1 A_6) + area(\mathcal{O} A_2 A_1)] \\
&= 3 \frac{R^2}{2} [ -\sin 150 \degree +\sin 30 \degree ] \\
&= 0 \end{aligned}\]
<p>But this can also be written as</p>
\[S = area(C_1 C_2 C_3) + 3 \cdot area(C_1 A_1 A_2)\]
<p>Which gives:</p>
\[3 \cdot area(C_1 A_1 A_2) = -area(C_1 C_2 C_3)\]
</aside>
<p><strong>Chapter 2. The Planimeter</strong></p>
<ol>
<li>This chapter entirely describes the workings of a device called the <em>planimeter</em>, which is used to measure [signed] areas of printed curves via, essentially, the shoelace formula acting on a polygonal approximation.<br />
I do not care about planimeters, though I’m sure this was interesting in 1959.<br />
In fact it turns out there are multiple kinds of planimeters out there, each of which the reader is invited to deeply contemplate.<br />
There is one theorem I found interesting, though:</li>
<li>
<p>“Imagine a directed segment \(\overrightarrow{AB}\) in the plane. Let us move it around the plane, finally bringing it back to its original position. In this motion the end \(A\) and \(B\) will of course trace out closed curved \(L_A\) and \(L_B\). Also, \(\overrightarrow{AB}\) will sweep out an oriented area \(S\).”<br />
Theorem:</p>
\[area(S) = area(L_A) - area(L_B)\]
<p>This is simple to convince yourself of – as oriented areas, we must have \(area(S) = area(L_A) + area(L_B)\), and one of the areas must be oppositely-oriented so we may pick \(area(L_B)\) to be negative.</p>
</li>
</ol>
<p><strong>Chapter 3. Computation of the Area of a Polygon Useful in Surveying</strong></p>
<ol>
<li>This section primarily produces my formula (5). The argument for its use in surveying is that the formula uses only scalar lengths and angles, meaning that one can find the area of unwieldy regions (say, a plot of land) by measuring all the lengths and angles individually and then computing the result.</li>
<li>But before getting to that formulation, Lopshits introduces the concept of <em>vectors</em>. I am not understanding what level this text is meant for.</li>
<li>Derives of equation (4), though without summation formulas.</li>
<li>Oriented angles work better than regular ones for this calculation.</li>
<li>Lot of words about computing what I have called \(\theta_{ij}\).</li>
<li>Derivation of equation (5), for which Lopshits was cited on Wikipedia.</li>
</ol>
<p>And that’s pretty much… it. I’m a little disappointed.</p>
<hr />
<h2 id="4">4</h2>
<p>It’s worth discussing how the shoelace formula is related to integral calculus. After all, if anything is the ‘canonical’ way to calculate area, it’s an area integral: \(area(P) = \iint_P 1 \, dx dy\).</p>
<p><a href="https://en.wikipedia.org/wiki/Green%27s_theorem">Green’s theorem</a> says how we can translate an area integral over a region into an integral over its boundary. Specifically, it says that an area integral over a region is equivalent to certain a line integral around the boundary of the region. If \(L\) and \(M\) are functions with continuous partial derivatives in a region \(D\) bounded by an (oriented!) curve \(C\), then:</p>
\[\oint_C L \, dx + M \, dy = \iint [ \p_x M - \p_y L ] dx dy\]
<p>For the simple case of \(\iint 1 \, dx dy\), we just need to find any \(L,M\) which have \(\p_x M - \p_y L = 1\). This is easily done with either \(M(x,y) = x\), \(L(x,y) = -y\), or any combination of them, like \(\frac{1}{2} (M + L)\). Therefore each of these gives an integral for area:</p>
\[\begin{aligned} area(D) &= \iint_D 1 \, dx dy \\
&= \oint_C x dy \\
&= \oint_C -y dx \\
&= \frac{1}{2} \oint_C x \, dy - y \, dx = \frac{1}{2} \oint_C (x,y) \times (dx, dy) \end{aligned}\]
<p>The connection is this:</p>
<p>Suppose we are computing \(\frac{1}{2} \oint_P (x,y) \times (dx, dy)\), ie, computing a line integral around the boundary of one of the oriented polygons \(P\) from before. Then, for the entire segment of the integral along the side \(p_i p_{i+1}\), the tangent direction is parallel to \(\b{d}_i\). Obviously these sums of ‘infinitesimal’ triangles along the side should add up to give the finite-sized triangle area of \(p_i \times \b{d}_i\):</p>
<p><img src="/assets/posts/2018-08-06/21-integral.svg" width="200px" /></p>
<p>To be a bit more explicit, we can parameterize the curve along a side \(p_i p_{i+1}\) by \(t \in [0,1]\), so that \((x,y) = p_i + t \b{d}_i\). Then \((dx, dy) =\b{d}_i dt\), and:</p>
\[\int_{p_i}^{p_{i+1}} (x,y) \times (dx, dy) = \int_0^1 (p_i + t \b{d}_i) \times \b{d}_i \, dt = p_i \times \b{d}_i\]
<p>Adding the contributions from every side give:</p>
\[\frac{1}{2} \oint_C (x,y) \times (dx,dy) = \frac{1}{2} \sum_i p_i \times \b{d}_i\]
<hr />
<p>While we’re at it, maybe we can come up with an ‘integral form’ of (3) or (4). Naively, it should look something like this, right?</p>
\[area(D) = \frac{1}{2} \sum_i \sum_{j < i} \b{d}_j \times \b{d}_i \Lra \frac{1}{2} \int_0^1 \int_0^t \; \dot{\vec{\gamma}}(s) \times \dot{\vec{\gamma}} (t) \, ds \, dt\]
<p>Where we parameterize the curve \(C\) enclosing our region as \(\vec{\gamma}(t)\) for \(t \in [0, 1]\).</p>
<p>This is just taking the integral formula \(\frac{1}{2} \oint_C \vec{\gamma}(t) \times \dot{\vec{\gamma}}(t) dt\) and replacing \(\vec{\gamma}(t)\) with \(\int_0^t \dot{\vec{\gamma}}(s) ds\), which should be fine as long as [<em>mumble</em>]. If we separate \(\vec{\gamma}(t)\) into \(r(t)\) and \(\theta(t)\) we should get a version of (5), also.</p>
<hr />
<h2 id="5">5</h2>
<p>The fact that the shoelace formula is a consequence of Green’s theorem means that there should be a way to take it to higher dimensions. Green’s theorem readily generalizes to <a href="https://en.wikipedia.org/wiki/Stokes%27_theorem">Stoke’s Theorem</a> in arbitrary spaces, which says that the integral of a function (or differential form) over a closed surface can be equated to the integral of its derivative through the enclosed volume:</p>
\[\oint_{S} f = \int_V df\]
<p>This means that:</p>
<ul>
<li>the shoelace formula should continue to calculate the areas of oriented polygons in \(N>2\) dimensions</li>
<li>there should be an analog to the shoelace formula for computing volumes, 4-volumes, etc</li>
</ul>
<p>For now I will mention how to calculate area in 3D, because it looks a little different than in 2d.</p>
<p><strong>Area of a figure in 3d</strong>: Suppose we want to compute the area of the triangle:</p>
\[T = (t_1,0,0), (0,t_2,0), (0,0,t_3)\]
<p>We can compute the shoelace answer:</p>
\[\begin{aligned} area(T) &= \frac{1}{2} \big[ (t_1,0,0) \times (0,t_2,0) + (0,t_2,0) \times (0,0,t_3) + (0,0,t_3) \times (t_1,0,0) \big] \\
&= \frac{1}{2} \big[ (0,0,t_1 t_2) + (t_2 t_3, 0, 0) + (0, t_3 t_1, 0)\big] \\
&= \frac{1}{2} (t_2 t_3, t_3 t_1, t_1 t_2)
\end{aligned}\]
<p>But that’s… not a scalar. What went wrong?</p>
<p>The answer is that this the area of \(T\) represented as a <em>vector</em>, which is normal to the plane of \(T\). It tells you <em>more</em> than the area – it also tells you what direction the area faces. To get to the <em>scalar</em> area, though, you have to take its magnitude:</p>
\[area(T) = | \frac{1}{2} (t_2 t_3, t_3 t_1, t_1 t_2) | = \frac{1}{2} \sqrt{t_2^2 t_3^2 + t_3^2 t_1^2 + t_1^2 t_2^2}\]
<p>But you lose the sign when taking the magnitude – the direction of that vector was what was telling us the orientation. In 3D, it’s meaningless to say that a surface is ‘positively’ oriented – what if it’s orthogonal to the \(XY\) plane? Should that be positive or negative? What if we flip it over? Orientation cannot be an intrinsic property of a shape if it changes as we rotate things!</p>
<p>Instead we can just talk about how things are oriented <em>relative</em> to each other. A figure on the \(XY\) plane is either oriented in the \(\b{z}\) direction or the \(-\b{z}\) direction. In 2D we drop the \(\b{z}\)’s and just call this positive and negative. In 3D, though, <em>volumes</em> have an absolute concept of orientation, but in 4D they would not – they would just end up oriented to the, say, \(+\b{w}\) or \(-\b{w}\) axes!</p>
<p>The shoelace formula used on a general polygon in 3D will work the same way, giving a vector. Note that this does require the polygon to be already known to be <em>planar</em>, or the answer you get will be meaningless.</p>
<aside class="toggleable" placeholder="<b>Aside</b>: Note on coplanarity <em>(click to expand)</em>">
<p>How do you tell if a list of points \(\{p_i\}\) is coplanar?</p>
<p>Well, non-degenerate triangles always are, because it takes three points to define a plane. But it’s easy in general: since the line between any pair of the points \((p_i, p_j)\) should fall on that plane, all of those lines should have cross products with each other that point <em>out</em> of the plane.</p>
<p>So: first, find any three points \((a,b,c)\) from the set which are not co<em>linear</em>, and compute \(\b{q} = (b-a)\times (c-a)\). This vector \(\b{q}\) definitely points out of the plane, so it should be orthogonal to any vector <em>on</em> the plane – such as the vector from \(a\) to any other point, \((p_i - a)\). So check that \(\b{q} \times (p_i - a) =0\) for every \(i\). That’s O(N) computations in total, no problem.</p>
</aside>
<p>By the way – if you want to compute the surface area of a figure by summing the areas of each side, go right ahead, but make sure you normalize each side’s area to be positive first. The surface area <em>vectors</em> of a closed figure in 3D will always cancel out to \(\b{0}\). (Why? Because of Stoke’s theorem: \(\iint_{S} 1 \, d\b{S} = \iiint_V d(1) dV = 0\).)</p>
<hr />
<p>As for volumes – that will have to wait for another article. Higher-dimensional shapes like polyhedra are much harder to deal with than polygons, primarily because it’s just hard to <em>represent</em> them. It’s not enough to supply a list of vertices \(\{p_i\}\), because you also needs to specify which sets of vertices make each face, and to ensure that the faces are oriented consistently: if the edge \(\overrightarrow{uv}\) appears on one face, then \(\overrightarrow{vu}\) must appear on the other face that shares that edge.</p>
<p>I will say only that if you <em>do</em> have a list of oriented polygonal faces \(\{F_i\} = \{\{f_{ij}\}\}\), then the volume of the pyramid created by a face with the origin is given by</p>
\[\begin{aligned} volume(F_i) &= \frac{1}{6} f_{i0} \cdot \sum_j f_{ij} \times f_{i,j+1} \\
&= \frac{1}{3} f_{i0} \cdot area(F_i)
\end{aligned}\]
<p>(\(f_{i0}\) just has to be any point on the face – we’re summing the volumes of the tetrahedrons \((\mathcal{O}, f_{i0}, f_{ij}, f_{i,j+1})\), and we need any third point in order to turn areas into volumes. \(area(F_i)\) in this case is the area <em>vector</em>, not the scalar.)</p>
<p>Then the volume of the whole figure is given by:</p>
\[volume(F) = \sum volume(F_i)\]
<hr />
<h2 id="6">6</h2>
<p>That’s all I’ve got, for now. Hope it’s useful, interesting, or otherwise not a total waste of time. I hope to revisit higher-dimensional shoelace-type formulas at some point, but it’ll have to wait. This was exhausting (well, making it pretty was).</p>
<p>I made the diagrams using <a href="https://en.wikipedia.org/wiki/PGF/TikZ">Tikz</a>, which is what all those fancy diagrams in LaTeX documents and textbooks are made in, and it was a lot of work but I’m glad I’ve started to learn how to do it. The process of conveniently importing Tikz images into websites, though, is … not enjoyable. I would be so excited if there was a project like <a href="https://www.mathjax.org/">Mathjax</a> that extracted a subset of Tikz for inline web-document creation.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:vector" role="doc-endnote">
<p>I like to use boldface to refer to things that are definitely <em>vectors</em>, as opposed to our \(p_i\) which are <em>points</em> and cannot be added. <a href="#fnref:vector" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Geometric Mean and Standard Deviation2018-06-15T00:00:00+00:00http://alexkritchevsky.com/2018/06/15/geometric-mean<p>A friend is writing her master’s thesis in a subfield where data is typically summarized using <em>geometric</em> statistics: geometric means and geometric standard deviations (GSD), and sometimes even geometric standard errors – whatever those are. And occasionally ‘geometric confidence intervals’ and ‘geometric interquartile ranges’.</p>
<p>Most of which are (a) not something anyone really has intuition for and (b) surprisingly hard to find references for online, compared to regular ‘arithmetic’ statistics.</p>
<p>I was trying to help her understand these, but it took a lot of work to find easily-readable references online, so I wanted to write down what I figured out.</p>
<!--more-->
<h2 id="1-whats-the-point-of-using-a-geometric-mean">1. What’s the point of using a Geometric Mean?</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Geometric_mean">geometric mean</a> of a dataset \(\b{x} = \{x_i\}\) is given by:</p>
\[\GM[\b{x}] = (\prod x_i)^{\frac{1}{n}}\]
<p>Though it is easier to understand through this equation<sup id="fnref:log" role="doc-noteref"><a href="#fn:log" class="footnote" rel="footnote">1</a></sup>:</p>
\[\GM[\b{x}] = e^{\AM[\log \b{x}]}\]
<p>For example:</p>
\[\GM[1,2,3,4,5] = \sqrt[5]{1\cdot 2 \cdot 3 \cdot 4 \cdot 5} = e^{\frac{1}{n}(\log 1 + \log 2 + \log 3 + \log 4 + \log 5)} \approx 2.61\]
<p>A more meaningful example: if something increases by \(x_i\) percents for \(i = (1 \ldots 10)\), then its total increase is by \(\GM(x_i)^{10}\). Multiplying by 10 different numbers gives the same result as multiplying by their geometric mean ten times.</p>
<p>(The base of the exponent and logarithm can be anything. It cancels out – if you use \(\log_b\), you raise \(b\) to the power afterwards: \(\GM[\b{x}] = b^{\AM[\log_b \b{x}]}\).)</p>
<p>(By the way: using \(\GM[]\) and \(\AM[]\) as notations for these things is definitely not conventional, but I think it makes it easier to read in settings where you’re using both.)</p>
<p>This formula means that you are computing “the average of the <em>logarithms</em> of your data”, and then rescaling that back so the units work out. The same process is used for the geometric standard deviation / confidence intervals / interquartile ranges / whatever: (a) calculate the regular statistic on the log data, like \(\text{SD}[\log x]\), then rescale back: \(\text{GSD}[x] = e^{SD[\log x]}\).</p>
<p>The simplest case where log-transformed statistics are more useful is if we are dealing with data whose values range over many orders of magnitudes. This is true for things that we already use logarithmic scales for, like <a href="https://en.wikipedia.org/wiki/Apparent_magnitude">brightness of stars</a>, <a href="https://en.wikipedia.org/wiki/Decibel">loudness of sounds</a>, <a href="https://en.wikipedia.org/wiki/PH">acidity/basicity of solutions</a>, <a href="https://en.wikipedia.org/wiki/Octave">sound frequency</a>, or <a href="https://en.wikipedia.org/wiki/Richter_magnitude_scale">power of earthquakes</a>.</p>
<p>In all these cases, notice that you don’t really know what the original measurements were. Like, presumably Richter measures the energy of the earthquake, but who knows what the values actually look like. Is a 7.5 Richter earthquake \(10^{15} J\) of energy? \(10^{20} J\)? Who knows?<sup id="fnref:richter" role="doc-noteref"><a href="#fn:richter" class="footnote" rel="footnote">2</a></sup> If we did statistics on the non-transformed values themselves, a single huge value like \(10^{20}\) would become the mean of any data set, so our summary statistics would not summarize much of anything.</p>
<p>Basically, each of these scales is measuring the logarithms of <em>something</em>, but for reporting data we don’t care as much what the ‘something’ is. If that’s the case, I think you may as well just log-transform all your data and then be done with it. Don’t worry about geometric means, just logarithm everything and take arithmetic means.</p>
<p>The geometric mean and related statistics are for when you log-transform your data for analysis, and then want to transform it <em>back</em>. For instance, if you wanted to report the power of an earthquake in <em>Joules</em>, not in Richter, maybe because you’re comparing it to other numbers reporting in Joules. Or, say, if the original data came in very sensible units like <code class="language-plaintext highlighter-rouge">parts per million</code> or <code class="language-plaintext highlighter-rouge">growth % year-over-year</code>, and you don’t want to report a value that’s been log-transformed to be unrecognizable.</p>
<hr />
<h2 id="1b-but-actually-when-do-you-use-it">1b. But actually: when do you use it?</h2>
<p>It’s turning out to be surprisingly tricky to get a straight answer. <a href="https://medium.com/@JLMC/understanding-three-simple-statistics-for-data-visualizations-2619dbb3677a">Some people</a> think the answer is ‘almost always’. But it appears to be pretty subjective.</p>
<p>Here’s the basic reason, though:</p>
<ul>
<li>You want to use a geometric mean if it makes more sense for your data to be multiplied together, rather than added together.</li>
</ul>
<p>(In fact, it may as well just be called a ‘multiplicative mean’, and then let’s call the arithmetic mean the ‘additive mean’ at the same time.)</p>
<p>Here are some signs that this might be true:</p>
<ul>
<li>the factors that cause your data to vary at all apply ‘multiplicatively’, instead of ‘additively’
<ul>
<li>that is, a factor’s effect is proportional to the size of your data. This would be anything like an ‘increase in efficiency’ or an ‘increased rate of occurrence’</li>
<li>in practice <em>most things</em>, apparently, work this way. Growth rates, heights, densities, power outputs, dollar amounts, disease rates, …</li>
<li>but generally you may not be sure, or able to tell, if this is true, so here are some more signs:</li>
</ul>
</li>
<li>corollary: the logarithm of the data looks more like a normal distribution than the data itself (ie your data is <a href="https://en.wikipedia.org/wiki/Log-normal_distribution">log-normal</a>)</li>
<li>the data definitely cannot go below 0, by definition of what you’re measuring</li>
<li>and effectively never goes <em>to</em> 0, either, since that would make the geometric mean 0 also (though see note, below)</li>
<li>the data do not have a constant translation factor (ie, if you could just as easily have measured \(x + 50\) instead of \(x\), then the logarithm is going to be meaningless.)
<ul>
<li>so don’t take the GM of temperature scales, unless you translate them to Kelvin first</li>
</ul>
</li>
<li>the data <em>do</em> have an arbitrary multiplicative factor (if you could have just as easily measured \(50x\) instead of \(x\), by changing units)</li>
<li>the data are normalized against some constant value (so, if it is a % of <em>anything</em>)</li>
<li>or, generally, the data are ratios, such as concentrations of a substance, rates of occurrence of an event, or changes in a value per some unit of time.</li>
</ul>
<p>People online seem to think that many – maybe even <strong>most</strong> – quantities turn out to vary multiplicatively, more or less. Probably many more than are being currently summarized using geometric means!</p>
<p>I’d say a general rule of thumb is: if you feel, for your type of data and some value of \(N\), that a reasonable average of \(10^N\) and \(10^{-N}\) is \(1\) rather than \(\frac{10^N}{2}\), you should use a geometric mean.</p>
<p><strong>Examples</strong>:</p>
<ul>
<li>Company growth? Doubling revenue and halving revenue should average to \(1\), not \(1.25\). <em>Geometric</em>.</li>
<li>Disease incidence? 1-in-100 and 1-in-10000 should probably average to 1-in-1000, not 1-in-200. <em>Geometric</em>.</li>
<li>Microbe concentration? 1 part-per-million and 10000 parts-per-million should probably be 100 parts-per-million. <em>Geometric</em>. One bad lake shouldn’t mess up them all.</li>
<li>Heights of kids in a classroom? a bunch of 5.somethings and some 6.somethings — it doesn’t really matter, they’ll be really close.</li>
<li>Distances of fire stations to homes in a city? Tricky. I’d say <em>arithmetic</em>; you want your mean to be sensitive to that small cluster of houses that is weirdly 10x as far from service as everyone else.</li>
<li>Split times on a 10km race? <em>Arithmetic</em>, you almost certainly want the mean to be one tenth of the total time.</li>
<li>Speeds on each leg of a road trip, like \(60\) mph and \(45\) mph. Trick question, <em>neither</em>, you want the harmonic mean, which is pretty much only for speeds; see the end of this article.</li>
</ul>
<p>Put differently: if massive values that are many orders of magnitude higher than others are <em>expected</em> and should not basically delete all your other data, then use a geometric mean.</p>
<hr />
<h2 id="1c-important-note-about-0">1c. Important note about 0</h2>
<p>If your data includes \(0\), then the geometric mean is 0.</p>
<p>But wait. What if you measure, like, bacteria concentration in a lake (a definitely order-of-magnitude-based value), and get 0 because you don’t detect anything? Or you detect ‘no earthquakes’ on some particular day?</p>
<p>And indeed, it turns out lots of scientists have data with 0s in it, and they’ve just been … <a href="http://www.arpapress.com/Volumes/Vol11Issue3/IJRRAS_11_3_08.pdf">working around it?</a></p>
<p>Okay. Maybe it is <em>possibly</em> reasonable to either:</p>
<ul>
<li>delete 0 values from your data, or</li>
<li>replace 0s with ‘very small numbers that aren’t 0’, such as \(\e = 10^{-k}\) for some \(k\) that makes them smaller than all your non-zero values</li>
</ul>
<p>…in order to keep your geometric mean from being literally 0. Why: maybe your instruments aren’t sensitive to detect values that are very near 0 and just report 0 instead. Or maybe your data is a ratio like ‘counts of a molecule per million liters’, and it is assumed that, no matter what, if you sampled <em>enough</em>, the thing you’re counting would show up at least once, if you kept looking, so it can’t ‘really’ be 0.</p>
<p>But, and <strong>this is important</strong>: if there are enough zero data points that your <em>choice of \(\e\)</em> for rounding is changing your geometric mean significantly, you are probably doing something wrong.</p>
<p>And if you decide instead of <em>delete</em> the 0 values, you better report the result as “the average X when X was present”, rather than just “the average X”, or you’re just <a href="https://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/0393310728">lying with statistics</a>.</p>
<p>It’s pretty confusing, though. You aren’t wrong for being unsure. People use geometric means all the time with data that can be zero, maybe doing one of the above workarounds, and I <em>really</em> doubt they’re all handling it correctly or reporting it correctly afterwards.</p>
<p>Oh, and by the way: if you are geometric-meaning a bunch of, say, annual company growth percentages, and getting 0 – say, for example, the data \(x = (-20\%, 0\%, 50\%, 120\%)\) – you’re doing it wrong. Those growth rates are percentages and need to be changed to actual multiplicative factors \(x = (0.8, 1.0, 1.5, 2.2)\), which gets rid of the zeroes. The only \(0\%\) growth is your company going out of business.</p>
<hr />
<h2 id="2-equations-for-geometric-statistics">2. Equations for Geometric Statistics</h2>
<p>Once again, the <strong>geometric mean</strong> is the log-transformed arithmetic mean:</p>
\[\GM[x] = e^{\AM[\log x]} = \sqrt[n]{\prod_i x_i} = \prod_i \sqrt[n]{x_i}\]
<p>By the <a href="https://en.wikipedia.org/wiki/Inequality_of_arithmetic_and_geometric_means">AM-GM inequality</a>, which is often just referred to as AM-GM, the geometric mean is <em>always</em> less than the arithmetic mean (if the inputs are all positive. Otherwise all bets are off. Don’t use GM on negative numbers!):</p>
\[\AM[x] \geq \GM[x]\]
<p>Another important relationship, which explains why GM works so well for <em>ratios</em>:</p>
\[\GM \Big[ \frac{x_i}{y_i} \Big] = \frac{\GM[x_i]}{\GM[y_i]}\]
<div class="box">
<p><strong>Important note on doing statistics with a computer</strong>:</p>
<p>For the love of all things that are good, <em>do not program the formula \(\GM[x] = \sqrt[n]{\prod x_i}\) into a computer</em> (if you do not know what you are doing). Use a built-in library for it, or use one of the other two formulas (\(= e^{\AM[\log x]}\) or \(= \prod \sqrt[n]{x_i}\)). Multiplying arbitrarily long lists of numbers together can overflow the data values in some programming languages, causing them to wrap around to negative numbers, rendering all your math utterly wrong. In particularly unfortunate cases, the resulting value will, by chance, be <em>close</em> and so seem reasonable, yet have been computed an entirely wrong way.</p>
<p>If you’re in a data-science-oriented language like R, you <em>might</em> get away with it, cause they tend to avoid including <a href="https://en.wiktionary.org/wiki/footgun">footguns</a>. If you’re in, like, C, you won’t.</p>
</div>
<p>The <strong>geometric standard deviation</strong> (GSD) is the same transformation, applied to the regular standard deviation.</p>
\[\text{GSD}[x] = e^{\text{SD}[\log x]}\]
<p>This is going to be useful if and only it was a good idea to use a geometric mean on your data, and particularly if your data is <em>positively skewed</em>. Make sure you realize what this is saying. When using a Geometric Standard Deviation, the phrase “68.2% of values fall within one standard deviation of the mean” <em>means something different</em>:</p>
<p>The GSD, instead of giving an equal range on either side of the mean, gives an equal <em>factor</em>:</p>
\[\begin{aligned} e^{AM \pm SD} &= e^{AM}e^{\pm SD} \\
&= \GM[x] \text{GSD}[x]^{\pm 1} \\
&= \GM[x] \; {}_{\div}^{\times} \; \text{GSD}[x] \\
& \stackrel{!}{\neq} \GM[x] \pm \text{GSD}[x] \end{aligned}\]
<p>Note that \(\GM[x] \text{GSD}[x]^{\pm 1}\) means the two values are \((\GM[x] \text{GSD}[x], \frac{\GM[x]}{\text{GSD}[x]})\). Clearly the GSD should be not too different from \(1\), such that 68.2% of values fall in the range of \(\text{GSD}[x]^{\pm 1}\).</p>
<p>Analogously, if your data is well-described by a GM + GSD, it’s probably <em>not</em> well-described by an AM + SD, because it should be positively skewed, while your SD would suggest that the data is spread evenly around the mean. (Consider how weird it would be to say that “X% of the data falls within \(\frac{\mu}{2}\) and \(2 \mu\)” for non-skewed data like a normal distribution.)</p>
<p>If reporting confidence intervals for a given \(P\)-value, using geometric statistics, they will also not be the same distance from the geometric mean. For \(P=0.95\), with z-score \(z = 1.96\) (ie laughably permissible, false positives everywhere), the confidence intervals are:</p>
\[\text{CI}_{0.95} = \GM[x] e^{\pm 1.96 \text{SD}[x]} = \GM[x] GSD^{\pm 1.96}\]
<p>For interquartile ranges, it’s:</p>
\[\text{IQR} = \GM[x] e^{\pm 0.67 \text{SD}[x]} = \GM[x] GSD^{\pm 0.67}\]
<hr />
<p>What about <strong>geometric standard error</strong>? This would say “how far is the sample geometric mean \(\GM[x_i]\) from the true geometric mean of the data \(\GM[\b{x}]\)?”</p>
<p>Reminder, since this one is a little less common: the <a href="https://en.wikipedia.org/wiki/Standard_error">standard error</a> (or more precisely, the “standard error of the mean”) of a set of \(N\) samples drawn from a normal distribution \(\cal{N}(\mu, \sigma^2)\) is the “standard deviation of the mean of \(N\) values from the <em>true</em> mean of the distribution” and is given by \(\text{SE}[x,N] = \frac{\sigma_x}{\sqrt{N}}\). Intuitively, as we sample more values from the normal distribution, our computed mean is also a normally distributed random variable, but has a <em>smaller</em> standard deviation, by a factor of \(\frac{1}{\sqrt{N}}\).</p>
<p>This measurement is actually reported in papers sometimes (apparently), but hard to find a good equation for online. It turns out that it’s given by:</p>
\[\text{GSE}[x, N] = \frac{\mu_G}{\sqrt{N}} \sigma_{\log x}\]
<p>Where \(\sigma_{\log x}\) is the standard deviation of \(\log x\), <em>not</em> the ‘geometric standard deviation’, and \(\mu_G\) is the <em>true</em> geometric mean of \(x\) (or at least, one that you have <em>way</em> more confidence in, like from a separate and much larger study).</p>
<aside class="toggleable" id="analysis" placeholder="<b>Aside</b>: Derivation of Geometric Standard Error">
<p>This derivation is adapted from <a href="https://projecteuclid.org/download/pdf_1/euclid.aoms/1177731830">Norris</a> with some jargon and theory removed, and replaced with my inexpert and likely faulty analysis.</p>
<p>Writing \(\mu_G = \GM[x] = e^{\E [\log x]}\), the true geometric mean of the distribution (assuming our data’s logarithm is normally distributed, as above):</p>
\[\begin{aligned} GSE^2 &= \E[(\GM[x_i] - \mu_G )^2] \\
&= \mu_G^2 \E [ (\frac{\GM[x_i]}{\mu_G} - 1)^2] \\
\end{aligned}\]
<p>First we simplify the first term in the expectation:</p>
\[\frac{\GM[x_i]}{\mu_G} = \frac{e^{\frac{1}{N} \sum \log x_i }}{e^{\E [\log x]}} = e^{\frac{1}{N} \sum \log x_i - \E[\log x]}\]
<p>Next we argue that, because for sufficiently large \(N\), \(\frac{1}{N} \sum \log x_i - \E[\log x] \Ra 0\), ie, as the sample log mean approaches the true log mean, we can approximate the exponential with its Taylor series \(e^x \approx 1 + x\).</p>
<p>To first order this gives:</p>
\[\begin{aligned} GSE^2
&\approx \mu_G^2 \E [ (\frac{1}{N} \sum \log x_i - \E[\log x])^2 ] \\
&=\frac{\mu_G^{2}}{N^2} \E [\sum (\log x_i - \E[\log x]))^2 ] \\
&= \frac{\mu_G^2}{N} \E [(\log x_i - \E[\log x])^2] \\
&= \frac{\mu_G^2}{N} \sigma^2_{\log x} \\
GSE &= \frac{\mu_G}{\sqrt{N}} \sigma_{\log x} \end{aligned}\]
<hr />
<p><strong>Analysis of the approximation</strong>:</p>
<p>When \(f(X)\) is analytic over the range of \(X\) we can compute \(\E[f(X)]\) by <a href="https://en.wikipedia.org/wiki/Taylor_expansions_for_the_moments_of_functions_of_random_variables">writing it as a Taylor series</a> around the mean \(\E[X] = \bar{X}\):</p>
\[\E[f(X)] \approx \E[f(\bar{X}) + f'(\bar{X}) (X - \bar{X})) + \frac{f''(\bar{X})}{2} (X - \bar{X})^2 + \ldots]\]
<p>But since expectations are linear we can write this as:</p>
\[\E[f(X)] \approx f(\bar{X}) + 0 + \frac{f''(\bar{X})}{2} \sigma^2_X + \ldots\]
<p>(Where the second term disappeared because \(\E[X - \bar{X}] = \E[X] - \E[X] = 0\).)</p>
<p>But it is <em>not</em> always valid to truncate this expression at the second-order term. The higher order terms each have a factor of \(\E[(X- \bar{X})^n]\), and there is no particular reason this should not be huge or even infinite. For \(\E[f(X)] \approx f(\bar{X}) + f''(\bar{X}) \sigma^2_X\), we require that the remaining terms in the series be small enough to ignore:</p>
\[\frac{f^{(n)}}{n!}(\bar{X}) \E[(X - \bar{X})^n] \approx 0 \; [\text{ for } n > 2]\]
<hr />
<p>In our case, let’s write the exponent as \(\D = \frac{1}{N} \sum_N (\log x_i - \E[\log x])\), and use \(\sigma = \sigma_{\log x}\) for the standard deviation of \(\log x\). By the <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Central Limit Theorem</a>, as \(N \ra \infty\), \(\D\) converges to a normal distribution with \(\mu = 0\), \(\sigma_{\D}^2 = \frac{\sigma^2}{N}\), even if \(\log x\) is not normally distributed, as long as \(\sigma^2\) is finite.</p>
<p>In this case, is it reasonable to truncate \(\E[f(\D)] = \E[(e^\D - 1)^2]\) to its first-order Taylor series \(\E[f(\D)] \approx f(\bar{\D}) + \frac{f''(\bar{\D})}{2} \frac{\sigma^2}{N} = \frac{\sigma^2}{N}\)?</p>
<p>I tried to work this out for a while, but realized that we don’t necessarily know <em>how</em> fast the distribution of \(\D\) converges to \(\cal{N}(0, \frac{\sigma^2}{N})\), so we can’t say what the error is for a particular value of \(N\). However, if \(\log x\) is already normally distributed (ie, we are dealing with truly lognormal data), then we can compute the size of the terms of we are dropping using the equation for the <a href="https://en.wikipedia.org/wiki/Normal_distribution#Moments">moments</a> of a normal distribution:</p>
\[\E[X^n]= \sigma^n n!! \; [n \text{ even}]\]
\[\Ra \E[\D^n] = (\frac{\sigma}{\sqrt{N}})^n n!! \; [n \text{ even}]\]
<p>And because \(\bar{\D} = 0\), \(f(\bar{\D}) = 0\) and \(f^{(n)}{(\bar{\D})} = 2^n - 2\). Writing \(n = 2k\) :</p>
\[f(\D) \approx \sum_{n \geq 2, \text{ even}} \frac{(2^{n} - 2) n!!}{n!} (\frac{\sigma}{\sqrt{N}})^n < \sum_{n \geq 2, \text{ even}} \frac{2^n n!!}{n!} (\frac{\sigma}{\sqrt{N}})^n\]
<p>The ratio between successive terms is \(\frac{f_{n+2}}{f_n} = \frac{1}{n+1}(\frac{2\sigma}{\sqrt{N}})^2\) so the series requires \(N > 4 \sigma^2\) to converge and \(N \gg 4 \sigma^2\) to converge rapidly enough to ignore later terms. Specifically, the first term we are dropping is \(n=4\), which is \(\frac{16}{3} (\frac{\sigma^{2}}{N})^2\), in case you want to compute an exact value to be sure it’s trivial. Which I guess doesn’t tell us much more: just that we need \(N \gg \sigma^2\).</p>
<p>Disclaimer: I am not a mathematician. Please don’t trust me.</p>
</aside>
<p>Since you don’t normally know \(\mu_G\), you can use the sample mean \(\GM[x]\), but then, like with the regular standard deviation and error formulas, you have to <a href="https://en.wikipedia.org/wiki/Bessel%27s_correction">change</a> \(N \ra N-1\):</p>
\[\text{GSE}[x, N] = \frac{\GM[x]}{\sqrt{N-1}} \text{SD}[\log x]\]
<p>The geometric standard error tells us that “our calculation of \(\GM[x]\) from \(N\) samples” has a probability distribution:</p>
\[\GM_N[x] \sim \cal{N}(\mu_G, \frac{\mu_G}{\sqrt{N}} \sigma_{\log x}) = \cal{N}(\mu_G, GSE[x,N])\]
<p>And the previous formula is our best estimate of it, given a sample of \(N\) values.</p>
<p>This is weird to compute. Until now we did not care about \(\sigma_{\log x}\). (And I assume, but am actually not sure, that \(\sigma_{\log x}\) has another factor of \(\frac{1}{N-1}\) in it, since it’s <em>also</em> computed from the sample?) But it should, for sufficiently high \(N\), give the correct numerical difference between \(\GM[x]\) and the true \(\mu_x\).</p>
<p>I’m not sure it’s a good idea to use the GSE, because I don’t think people will know how to think about it. It is a bit unintuitive: it is the only statistic we’ve talked about that is <em>additive</em> (so it’s talking about a difference \(\GM_N[x] - \mu_G\), rather than a ratio \(\frac{GM_N[x]}{\mu_G}\)). It’s based on the idea that, even for log-normal data, the sample geometric mean \(\GM_N[x]\) is normally distributed. (I did not say <em>log</em>-normal, though it may also be that! Even log-normal distribution samples are normally distributed for high-enough \(N\) – they are just also massively skewed.) Maybe some subfield thinks this makes perfect sense as a summary statistic – I don’t know. I would avoid it.</p>
<hr />
<h2 id="trivia-some-other-means">Trivia: Some Other Means</h2>
<p>By the way, there are other ways to summarize data than AM and GM. You can create all sorts of means using the same formula:</p>
\[\E_f [ X ] = f^{-1} \E [ f(X) ]\]
<p>For instance:</p>
<ul>
<li>AM: \(f(x) = x\)</li>
<li>GM: \(f(x) = \log x\)</li>
<li><a href="https://en.wikipedia.org/wiki/Harmonic_mean">Harmonic Mean</a>: \(f(x) = x^{-1}\)
<ul>
<li>Good for averaging <em>rates</em> of things, like the rates of a car over different legs of a trip. If you travel \(60\) mph one way and \(40\) mph the other way over the same distance (maybe there was some traffic), then your average speed is \(\frac{2}{\frac{1}{60} + \frac{1}{40}} \approx 48 \text{ mph}\), which is useful: the total travel time is equal if you travel \(48\) mph the whole time.</li>
</ul>
</li>
<li><a href="https://en.wikipedia.org/wiki/Root_mean_square">Root Mean Square</a>: \(f(x) = x^2\)
<ul>
<li>Used in electrical engineering, especially to measure the average strength of electrical signals. When \(\AM[x]=0\) (such as for an alternating current), then \(\text{RMS}(x) = \sigma_x\), the regular standard deviation.</li>
</ul>
</li>
<li><a href="https://en.wikipedia.org/wiki/Generalized_mean">Generalized means</a>: screw it, \(f(x) = x^p\) for any \(p\).
<ul>
<li>\(\infty\)-mean: set \(p=\infty\) and you get, weirdly, \(\E_{\infty}[x] = \max(x)\).</li>
<li>\(-\infty\)-mean: it’s just \(\min(x)\)</li>
<li>basically turn \(p\) up to get more contributions from higher values and turn it down to get more contributions from lower ones.</li>
<li>just kidding, don’t actually use anything except the first four, no one will know what you’re talking about if you use this.</li>
</ul>
</li>
</ul>
<p>Generally speaking if you find yourself with data of the form \(y = f(x)\), and you <em>know</em> it’s of the form \(y =f(x)\), it probably will be a good idea to summarize it with statistics on \(f^{-1} (y)\). These are just specific implementations of that idea. The geometric mean turns out to be the safest option, in general, because \(x \mapsto \log x\) has the nice property of smoothing out any kind of polynomial \(\ra p \log x\) and exponential \(e^x \mapsto x\).</p>
<p><em>[disclaimer: I am not a mathematician. What I wrote here is my best understanding at the time. Let me know if you think something is wrong or incomplete.]</em></p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:log" role="doc-endnote">
<p>I’m using \(\log x\equiv \log_e x \equiv \ln x\) throughout this page. This is normal in some fields and totally weird in others. Sorry if it’s weird. <a href="#fnref:log" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:richter" role="doc-endnote">
<p>Actually I looked this up and Richter is totally different: it measures the <em>amplitude of the local tremor</em>, as recorded on a seismograph (the displacement of a needle, or whatever) which is then normalized based on the distance from the measurement to the epicenter of the earthquake. It turns out there are <a href="https://en.wikipedia.org/wiki/Seismic_magnitude_scales">tons of</a> other scales, and some of them do a better job of measuring the actual energy of the earthquake. <a href="#fnref:richter" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Programming: Existence is Pain2018-04-19T00:00:00+00:00http://alexkritchevsky.com/2018/04/19/pain<p>My bike was stolen out of the backyard last night, so I’m feeling a little more aggravated by everything than usual.</p>
<p>This has had the effect of reminding me of a recurring sensation in my life as a software developer: that dealing with technology can be a <em>fundamentally miserable experience</em>, and that the skill of being ‘good’ at software is often mostly the same skill as <em>being able to take a lot of crap from faceless, abusive machines in ways that you feel powerless to do anything about.</em></p>
<p>So while I’m all for the “let’s teach everybody to code!” movement, I do sometimes wish we’d stop writing yet another Learn Machine Learning With Python Tutorial, or whatever, and just make maybe take some time to work on making everything the world around us better in little incremental ways, by making what we’ve already got <em>suck</em> less, for ourselves and for all the newcomers and for just everyone, so we can have less stress and more peace in our lives.</p>
<p>Basically some days I can’t honestly tell anyone they should get into this, when on a good day you get to slowly hack your way through bullshit and on a bad day you might just succumb and give up.</p>
<!--more-->
<hr />
<h2 id="an-illustrative-example">An illustrative example</h2>
<p>This morning I was trying to do some 3d graphing in Python. I’m just now getting comfortable in Python, finally, way later than I should’ve, after years of doing mostly Java and JS, so I’m looking up a lot of basic documentation.</p>
<p>I’m using <a href="https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html#getting-started">MatPlotLib</a>’s limited 3d support. It’s mostly a 2d drawing library, but has support for applying a projection matrix to 3d objects to render them onto a plane, and then doing rudimentary overlapping based on z-coordinates in the projection – or something like that. I don’t need much or I’d be using a more fully-featured, graphics-card-backed library, I guess.</p>
<p>See, look, I drew a thing:</p>
<p><img src="/assets/posts/2018-04-19/0-python.png" width="400px" /></p>
<p>Now, MatPlotLib is a library that’s used everywhere in the Python world, and students of programming everywhere are surely coming to this documentation. So when I go to look up how to make a 3d line plot, I do have to wonder why there aren’t quickly perusable code samples on the page to demonstrate usage of the APIs:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/1-mplot.png" width="400px" /></p>
<figcaption>
<p>mplot3d documentation</p>
</figcaption>
</figure>
<p>Surely showing us an example of it in use would not take too much space here.</p>
<p>Aha, but at least there’s a (<strong>source code</strong>) button, so we’ll just open that in a new tab (I’m using Chrome on a Mac):</p>
<figure>
<p><img src="/assets/posts/2018-04-19/2-rightclick.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And it’ll just open:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/3-chromewarning.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Oh, that’s annoying. And shouldn’t there be a button that says “yes, open this”?</p>
<p>Maybe it’s hidden in those dots.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/4-dots.png" width="300px" /></p>
<figcaption></figcaption>
</figure>
<p>Hrm. Okay, that works, but I’m gonna be doing this a lot. I really prefer to open it in a tab instead of downloading it. I guess I regret clicking “open in new tab”, so I’ll just go back and open it the right way and see if I find it:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/5-click.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Phew, at least this time it downloads without complaining. I wonder why it worked that time.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/6-open.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Let’s just click on that:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/7-xcode.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And… oh. It opened in Xcode. I didn’t want that. Why would I want that? Why would Apple default to that? I never use Xcode.</p>
<p>Well, what I wanted was to open it in the browser, since it’s just a plain text <code class="language-plaintext highlighter-rouge">.py</code> file. But apparently there’s no way to set Chrome’s file associations in its settings. I looked, for a while:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/8-chromesettings.png" width="600px" /></p>
<figcaption></figcaption>
</figure>
<p>Weird. I really thought that seting existed some years ago. Why would they take out something so useful?</p>
<p>I googled it, and found… some gibberish blog posts covered in ads instead of some simple documentation:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/9-googled.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>The first result is from some site, Chron.com, which I do not ever want to see again. Intuitively, it turns out to be filed under “Small Business» Types of Businesses to Start» Open a Bar»”, and is only for Windows, and isn’t even what I want, and is probably years old (there’s no date), and is a horribly manual way for setting <em>OS</em> file associations, rather than just telling the browser what I want.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/10-chron.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>I just want to tell Chrome what to open <code class="language-plaintext highlighter-rouge">.py</code> files with, and ideally that’s “in a tab”. Which should just be documented on the browser website, though it’s not like I’ve come to trust Google to do anything human-centric. So I guess I’ll ask a bit more specific question because, fortunately, these struggles are so common that people more driven than I do go ask about them:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/11-pyfiles.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>There are two relevant results at the top. The first is… weirdly… the opposite of what I want. I wonder how they got it working in the first place?</p>
<p>The second is on ‘superuser.com’, which is a StackExchange for computer…. superusage… (of <code class="language-plaintext highlighter-rouge">sudo</code>, ie <code class="language-plaintext highlighter-rouge">Super-User-DO</code>, fame) which bodes well, because most of the community references on the internet are total trash and StackExchange/Overflow/their derivatives are the only thing keeping any of us sane.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/12-pychrome.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Hey, it’s just what I’m looking for:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/13-superuser.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>The asker is even asking directly about MatPlotLib documentation! But there’s no answers, just comments. Commenter <code class="language-plaintext highlighter-rouge">Kat</code> points me to another <a href="https://superuser.com/questions/399538/getting-chrome-to-open-text-files-in-a-tab">related question</a>:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/14-superuser-text.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Which turns out to be also what I want and this time there’s an answer:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/15-superuser-answer.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>… but the answer says it’s built-in and can’t be changed, which is infuriating because it’s my computer and I should be able to do what I want. The answerer <code class="language-plaintext highlighter-rouge">Synetech</code> – bless their heart – submitted a <a href="https://bugs.chromium.org/p/chromium/issues/detail?id=118204&thanks=118204&ts=1331749873">bug</a> to Chromium, which (thank god) was not closed with a “fuck you” like many issues on many issue trackers are (I mean, implicitly). But it’s a few years old and, despite having a few people chiming in that it’s important to them, nothing ever came of it:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/16-chromium.png" width="600px" /></p>
<figcaption></figcaption>
</figure>
<p>And it was archived last year for inactivity:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/17-chromium-closed.png" width="600px" /></p>
<figcaption></figcaption>
</figure>
<p>So it’s probably not happening. Sigh. I know I should bump it, but… not today.</p>
<p>Now, I can work around this by reading the file locally, but I’m still annoyed at this point. Why do things have to be bad, on purpose? Why can’t they just be good?</p>
<p>Alright, back to the file on my computer, which I would like to open in Sublime text, not Xcode. Anyway, maybe I can tell Apple to open files with Sublime and Chrome will find out (even if it’s not willing to open things in tabs…):</p>
<figure>
<p><img src="/assets/posts/2018-04-19/18-showinfinder.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>I happen to know that if I want to change Mac’s default file association, I have to select ‘open with’ and tell it to always use this application, so I do that:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/19-openwith.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And I select Sublime Text and tick ‘Always Open With’:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/20-sublime.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And it just works:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/21-unidentified.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Oh. Er. No. No, um… I wanted to open it, and I don’t care that it’s from an unidentified developer. Please?</p>
<p>Why is there no “go to security preferences…” button? Didn’t there used to be one? I know I’ve seen it somewhere before. There’s a (?) but it just goes to useless help pages. Where’s my ‘way out’?</p>
<p>Fine, I guess I know what to do here: I go to the Mac security preferences and, for some reason, tell it to open the last file it saw.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/22-securityprefs.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Weirdly, this menu knows about what it just did, and lets me override it. You’d think that would be <em>on the error dialogue</em>, but nothing makes sense anymore.</p>
<p>When I click “open anyway” (no password actually required?), I get the dialogue I <em>should</em> have gotten last time:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/23-openfile.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And finally I can click “open”, and I’m <strong>FINALLY</strong> done:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/24-finally.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>But for good measure, because I’m an engineer-y type of person, I should make sure to make this more efficient for next time. Maybe I can find a way to disable that useless security dialog that I see all the time and have never wanted or found useful and has never saved anyone from anything except?</p>
<p>I google something:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/25-untrusted.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And find an <a href="https://support.apple.com/kb/PH25088?locale=en_US">Apple support page</a> that lets me know I can control-click to override, next time:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/26-controlclick.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>(Or just right-click to open. I wonder if their docs are just not aware that people can right-click on Macs now? It is pretty new. Ten years, I think.)</p>
<p>Although</p>
<blockquote>
<p>The safest approach is to look for a later version of the app from the Mac App Store or look for an alternative app.</p>
</blockquote>
<blockquote>
<p>To override your security settings and open the app anyway:</p>
</blockquote>
<p>Which would be fine if it wasn’t a <em>Python file</em>. Which will never be trusted. Because it’s text. Sometimes I wonder if the designers of operating systems are… actually… people? Surely they share some of my frustrations? Surely they realize this sucks, and would add a “I’m a developer and I need to open files sometimes” option, or something? I’m not, like, running a file that’s going to steal my passwords. It’s <em>plain text</em>.</p>
<p>Okay, I fiddle with my search terms:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/27-disable.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>This time I find something more useful:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/28-gatekeeper.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>I learn a few things from here. 1: The thing in Mac that does this annoying prevention of opening downloaded files is called Gatekeeper. Useful for later searching.</p>
<p>2: There used to be a way to disable it permanently, and they… took it out. Here’s an older screenshot:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/29-anywhere.png" width="400px" /></p>
<figcaption>
<p>apparently not a thing anymore</p>
</figcaption>
</figure>
<p>This makes me hateful.</p>
<p>Anyway, there must be a way to do it, right? I google “disable gatekeeper 2018”, hopefully to only get articles that aren’t from before Apple made their product intentionally worse:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/30-2018.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Which works swimmingly, though wow are these results sketchy:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/31-sierra.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>Somehow the 5 stars on the second one, plus the <code class="language-plaintext highlighter-rouge">drcleaner.com</code> domain, makes me assume it’s going to try to get me to download a virus-scanner that hijacks my computer to mine bitcoins as soon as I click on it, so I don’t.</p>
<p>The first result turns out to want to show notifications (why? why does anyone do this?), has two sets of social media buttons (why? does anyone click those? please stop doing it so they go away. And why <em>two</em>?) and a bunch of other screen-filling garbage, and is from 20<em>16</em> (maybe I should have searched for <code class="language-plaintext highlighter-rouge">High Sierra</code>)…</p>
<figure>
<p><img src="/assets/posts/2018-04-19/32-tekrevue.png" width="400px" /></p>
<figcaption>
<p>all of this comes before the article</p>
</figcaption>
</figure>
<p>…but at least it does recognize the issue and tells us that, yes, we’re not imagining things; Mac has been making this system more irritating with each release, and there is a way to fix it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo spctl -–master-disable
</code></pre></div></div>
<p>But before we <code class="language-plaintext highlighter-rouge">sudo</code> a random command, maybe let’s check what it does so we don’t shoot ourselves?</p>
<figure>
<p><img src="/assets/posts/2018-04-19/33-man.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>And specifically:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/33-man-2.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>That’s not very helpful. Maybe I don’t want to ‘master-disable’ every policy; maybe we just want to disable <em>one</em> policy? But when your APIs are stringly-typed and your documentation doesn’t document them… Ugh, I don’t know; this is not what I wanted to do today. I just wanted to learn how to make a line plot. I give up. Let’s just do it:</p>
<figure>
<p><img src="/assets/posts/2018-04-19/34-sudo.png" width="250px" /></p>
<figcaption></figcaption>
</figure>
<p>Hurrah! The button is back.</p>
<figure>
<p><img src="/assets/posts/2018-04-19/35-anywhere.png" width="400px" /></p>
<figcaption></figcaption>
</figure>
<p>What fun! What a fun way to open a file to view some code.</p>
<p>Um, precisely none of this had to be this way, and it wouldn’t even have been hard.</p>
<hr />
<h2 id="in-summary">In Summary</h2>
<figure>
<p><img src="/assets/posts/2018-04-19/36-pain.png" width="600px" /></p>
<figcaption></figcaption>
</figure>
<p>And please don’t forget to spend some time making your software less frustrating to use, even if you don’t know how to quantify the benefits of it. These things add up.</p>
Meditation on Taylor Series2018-03-30T00:00:00+00:00http://alexkritchevsky.com/2018/03/30/taylor-series<p><em>(Notes. Definitely not interesting unless, at minimum, you really really liked calculus.)</em></p>
<hr />
<h2 id="1">1</h2>
<p>We can often write a differentiable function \(f(x)\) as a <a href="https://en.wikipedia.org/wiki/Taylor_series">Taylor series</a> around a point \(x\), approximating it in terms of its derivatives at that point:</p>
\[f(x+a) = \sum_{0}^{\infty} \frac{a^{n} f^{(n)}(x) }{n!}\]
<p>And, under certain conditions, this series will converge exactly to the values of the function at nearby points.</p>
<!--more-->
<p><small>(It may be that there is a certain radius of convergence around \(x\) in which this approximation is valid. For the remainder of this page, assume we’re dealing with \(f\) and \(x\) such that \(f\) has a series which is convergent around \(x\) and we’re staying close enough for that to be valid.)</small></p>
<p><small>(You may be more used to seeing this as \(f(x) = \sum \frac{(x - x_0)^{n} f^{(n)}(x_0)}{n!}\). They’re equivalent, of course, but for our purposes it will be cleaner to write out it as an approximation for a displacement \(a\) from a point \(x\), rather than having to write the displacement as \((x - x_0)\).)</small></p>
<p>This can be written in a cleaner notation if we let ourselves treat the derivative operator \(\p_{x}\) as a variable (sometimes we will omit the subscript \(x\) to keep things uncluttered) and then treat the whole summation as an operator acting on <em>f</em>:</p>
\[f(x+a) = [\sum \frac{(a \p_{x})^{n}}{n!}]f(x)\]
<p>And it’s cleaner still if we recognize the summation as the Taylor series for \(e^{x}\) (neglecting, perhaps, to define this rigorously):</p>
\[f(x+a) = e^{a\p_{x}} f(x)\]
<p><small>(In physics we look at this and say that \(\p_{x}\) is the ‘generator of translations’, in the sense of generators of <a href="https://en.wikipedia.org/wiki/Lie_group">Lie Groups</a>, and that \(e^{a \p_{x}}\) is the <a href="https://en.wikipedia.org/wiki/Shift_operator">translation operator</a>.)</small></p>
<p>This form is especially nice because it lets us translate by one variable at a time when working with multivariate functions:</p>
\[f(x+a, y) = e^{a\p_{x}} f(x,y)\]
<p>Or translate by complex variables, using \(\p_{z} = \frac{1}{2}(\p_{x} - i \p_{y})\):<sup id="fnref:negative" role="doc-noteref"><a href="#fn:negative" class="footnote" rel="footnote">1</a></sup></p>
\[f(z + a) = e^{a \p_{z}} f(z)\]
<p>Or calculate higher-dimensional Taylor series, using \(\nabla = \p_{x} + \p_{y}\):</p>
\[f(\b{x} + \b{v}) = e^{\mathbf{v} \cdot \nabla} f(\mathbf{x}) = \Big[ \sum \frac{(v_{x} \p_{x} + v_{y} \p_{y})^{n}}{n!} \Big] f(x,y)\]
<p>Or write out multiple translations in a row:<sup id="fnref:curvature" role="doc-noteref"><a href="#fn:curvature" class="footnote" rel="footnote">2</a></sup></p>
\[f(x + a + b) = e^{b\p_{x}} e^{a\p_{x}} f(x) = e^{(b + a)\p_{x}} f(x)\]
<p>Or implement a time-translation operator for wave functions in (non-relativistic) quantum mechanics to compute how systems evolve in time while preserving total probability by construction<sup id="fnref:schrodinger" role="doc-noteref"><a href="#fn:schrodinger" class="footnote" rel="footnote">3</a></sup>:</p>
\[\psi(x, t) = e^{t \p_{t}} \psi(x, 0) = e^{- \frac{i t}{\hbar} H} \psi(x, 0)\]
<p>So it’s just all really great, when it works and the series converge and everything commutes the way you expect, and you can take integrals and derivatives term-by-term and everything’s somehow peachy.</p>
<p><small>(In physics we tend to, instead of carefully proving things converge, just do the calculations and <em>see</em> if they match what they should be afterwards, and then wave our hands and conclude that it works, because it’s easier that way and because (I suspect) getting a coherent calculus of operators is an analytical nightmare, and definitely not in immediate reach of the curious undergraduate.)</small></p>
<hr />
<h2 id="2">2</h2>
<p>Assume \(F(x) = \int f(x) dx\) exists, and consider antidifferentiation as a left inverse of the differentation operator:</p>
\[\p^{-1} f(x) = \int_{0}^{x} f(x') dx' = F(x) - F(0)\]
<p><small>(Why left? because \((\p \circ \p^{-1}) f(x) = f\), but \((\p^{-1} \circ \p) f = f + c\) is only equal up to a constant.)</small></p>
<p>What can we do with \(\p^{-1}\)? Well, we can produce the \(\frac{a^{n}}{n!}\) term in our Taylor expansion:</p>
\[\begin{aligned}
\p^{-1} (1) &= x \\
\p^{-2} (1) &= \frac{x^{2}}{2} \\
\p^{-n} (1) &= \frac{x^{n}}{n!} \\
\p_{a}^{-n} (1) = \p^{-n} (1) \Big|_{x = a} &= \frac{a^{n}}{n!} \\
\end{aligned}\]
<p>(That’s a derivative with respect to \(a\) instead of \(x\). They’re just symbols, after all.)</p>
<p>And therefore:</p>
\[e^{x} = \Big[ \sum_{n = 0}^{\infty} \p^{-n} \Big] 1 = 1 + x + \frac{x^{2}}{2!} + \ldots\]
\[e^{a\p_{x}} = \sum \p_{a}^{-n} (1) \p_{x}^{n} = 1 + a \p + \frac{a^{2}}{2!} \p^{2} + \ldots\]
\[f(x+a) = e^{a \p_{x}} f(x) = \Big[ \sum \p_{a}^{-n}(1) \p_{x}^{n} \Big] f(x)\]
<p>Really, since \(\p_{a}\) refers to a different variable, it will just treat \(\p_{x} f(x)\) as a constant – so we can just write:</p>
\[f(x+a)= \Big[ \sum (\p^{-1}_{a} \p_{x})^{n} \Big] f(x)\]
<p>Which has a nice symmetry to it. It reminds me of a change of basis, which, in some sense, it is.</p>
<p>It basically means: project \(f(x)\) onto its behavior at each polynomial order \(\frac{x^{n}}{n!}\), and then write it literally in terms of those polynomial orders using \(\frac{a^{n}}{n!}\). If (1) \(f\) is truly entirely constructible entirely from polynomials, and (2) the resulting sum converges, this should be equivalent to \(f(x+a)\).</p>
<hr />
<h2 id="3-misc">3 Misc</h2>
<p>If we consider \(f\) as an abstract function object which only takes on a value \(f(x)\) when composed with a point \(\hat{x} \circ f = f(x)\), then we can write a suggestive (but probably not too meaningful) equations like:</p>
\[(\hat{x} + \vec{a}) \circ f = \hat{x} \circ \Big[ \sum (\p^{-1}_{a} \p_{x})^{n} \Big] \circ f\]
\[(\hat{x} + \vec{a}) = \hat{x} \circ \sum (\p^{-1}_{a} \p_{x})^{n}\]
\[\hat{x} \circ (+a) = \hat{x} \circ \sum (\p^{-1}_{a} \p_{x})^{n}\]
<p>This says, approximately, that \(T_{\vec{a}}\), translation by \(\vec{a}\), is associative, and can be implemented in either x-space or ‘operators on functions’-space.</p>
\[(\hat{x} \circ T_{\vec{a}}) \circ f = \hat{x} \circ (T_{\vec{a}} \circ f)\]
<p>Of course it’s associative even if you can’t write \(f\) as a Taylor series, but, this gives a sort of ‘implementation’ for it, when it is.</p>
<p>There are other ways to conceptually ‘implement’ \(T_{\vec{a}} f\):</p>
<p>Since the derivative operator gives the value of \(f(x + \e) \approx f(x) + \e f'(x)\), at a point slightly displaced from \(x\), we can presumably in principle do this infinitely many times to move a finite distance \(a\):</p>
\[f(x + a) = \lim_{\D x \ra 0} f(x) + f'(x)\D x + f'(x + \D x)\D x + \ldots\]
<p>Which of course corresponds to just integrating the derivative of \(f\):</p>
\[f(x + a) = f(x) + \int_{0}^{a} f'(x + y) dy\]
<p>Alternatively we may write this as applying an infinitesimal translation operator \(T_{\e}\) infinitely many times, which just leads back to the exponential expression:</p>
\[f(x + a) = \lim_{\e \ra 0} T^{\frac{a}{\e}}_{\e} f(x) = \lim_{\e \ra 0} (1 + (T_{\e} - 1))^{\frac{a}{\e}} f(x)\]
\[= \lim_{\e \ra 0} (1 + \D_{\e})^{\frac{a}{\e}} f(x) = e^{a \p_{x}} f(x)\]
<hr />
<h2 id="4">4</h2>
<p>I don’t know, maybe this will be useful to someone, someday. I needed to write it down to keep various thoughts bundled together for later.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:negative" role="doc-endnote">
<p>The \(\frac{1}{2}\) and the negative sign are required so that \(\p_{z} \cdot dz =\) \(\p_{z} \cdot (dx + i dy) = 1\). <a href="#fnref:negative" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:curvature" role="doc-endnote">
<p>Whether \(e^{a\p_{x}}e^{b\p_{y}} = e^{b\p_{y}}e^{a\p_{x}}\) gets into whether the underlying manifold has any curvature. <a href="#fnref:curvature" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:schrodinger" role="doc-endnote">
<p>Given <a href="https://en.wikipedia.org/wiki/Schr%C3%B6dinger_equation">Schrödinger’s equation</a> \(H\psi = i \hbar \p_{t} \psi\), expand \(e^{t \p_{t}} \psi\) term-by-term, substitute, and unexpand. <a href="#fnref:schrodinger" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Some Intuition Around Entropy2018-02-23T00:00:00+00:00http://alexkritchevsky.com/2018/02/23/entropy-1<p><em>(Only interesting if you already know some things about information theory, probably)</em><br />
<em>(Disclaimer: Notes. Don’t trust me, I’m not, like, a mathematician.)</em></p>
<p>I have been reviewing concepts from <a href="https://en.wikipedia.org/wiki/Information_theory">Information Theory</a> this week, and I’ve realized that I never quite really understood what (Shannon) <a href="https://en.wikipedia.org/wiki/Entropy_(information_theory)">Entropy</a> was all about.</p>
<p>Specifically: I have finally understood how entropy is <em>not</em> a property of probability distributions per se, but a property of streams of information. When we talk about ‘the entropy of a probability distribution’, we’re implicitly talking about the stream of information produced by sampling from that distribution. Some of the equations make a lot more sense when you keep this in mind.</p>
<!--more-->
<h2 id="1-entropy">1 Entropy</h2>
<p>Recall that <strong>Information</strong> measures ‘the number of bits required to store data’. For example, an 8-bit number takes 8 bits of data to store or transmit. A binary digit, or the outcome of a coin flip encoded as a binary digit, takes 1 bit of data. The unit of bits is just a unit (corresponding to logarithms base-2); you can easily use others.</p>
<p>Meanwhile, <strong>Entropy</strong> is the <em>average amount of Information</em> which is learned by learning an unknown value, such as the outcome of a random variable that’s selected from a probability distribution. It’s a functional<sup id="fnref:functional" role="doc-noteref"><a href="#fn:functional" class="footnote" rel="footnote">1</a></sup> which maps a probability distribution to a quantity of Information. It usually takes the symbol \(H\) in information theory or \(S\) in physics. I’ll use \(H\).</p>
<p>The formula for the entropy of a discrete probability distribution \(X\) is:</p>
\[H[p] = -\sum_{x} p(x) \log p(x)\]
<p>This is better thought of as the ‘expected value of \(-\log p(x)\)’:</p>
\[H[p] = \bb{E}[- \log p(x)]\]
<p>I claim that \(- \log p(x)\) somehow captures the amount of information that we learn if we get the result \(x\), and so entropy somehow measures the expected value of information when sampling from this distribution.</p>
<p>Note that this equation for entropy is <em>always positive</em>. The negative sign is misleading: since \(p(x)\) is always less than 1, \(\log p(x) < 0\). It would probably be better to always write \(\log \frac{1}{p(x)}\) to emphasize this, but that’s not usually done. Also, since \(0 \log 0 = 0\), because \(\lim_{p\ra 0^{+}} p \log p = \lim_{N \ra \infty} \frac{-\log N}{N} = 0\), 0-probability events never happen and contribute nothing to entropy.</p>
<p>Some examples: the entropy of a fair coin flip is, as expected, 1 bit:</p>
\[H = - p(\tt{H}) \log p(\tt{H})- p(\tt{T}) \log p(\tt{T}) = -\frac{1}{2} \log \frac{1}{2} -\frac{1}{2} \log \frac{1}{2} = 1\]
<p>The entropy of a biased coin that has \(p(\tt{H}) = \frac{3}{4}\) is:</p>
\[H = - \frac{3}{4} \log \frac{3}{4} - \frac{1}{4} \log \frac{1}{4} = 2 - \frac{3}{4} \log 3 \approx 1.642\]
<p>This means, intuitively, that as we make a sequence of \(N \ra \infty\) outcomes from this unfair coin (with \(\ra \frac{3}{4}N\) <code class="language-plaintext highlighter-rouge">H</code>s), it will be possible to compress that sequence into a sequence of \(\ra 1.642N\) bits, since that is another way to communicate the same volume of information.</p>
<hr />
<h2 id="2-microstates-vs-macrostates">2 Microstates vs Macrostates</h2>
<p>The discrete entropy formula \(H[p] = \bb{E}[- \log p]\) is usually derived from some axioms about how it should behave, which I find useless for intuition. Instead, here’s the easiest way to see way to see why it’s defined that way.</p>
<p>Imagine a space of outcomes of some event, like “the results of \(X\) fair coin flips”, which we will describe using two different descriptions, one of <em>macrostates</em> which describes how many \(\tt{H}\) and \(\tt{T}\) were seen, and one of <em>microstates</em>, which describes the actual underlying sequence of results.</p>
<ul>
<li>At the microstate level, the system has \(M = 2^{X}\) equiprobable outcomes, which takes \(X\) bits of data to specify an outcome: the exact value of each coin flip.</li>
<li>At the macrostate level, there is a probability distribution that captures the chance \(p(x \tt{ Heads})\) of getting exactly \(x\) heads, given by \(p(x) = \binom{X}{x}/M\).</li>
</ul>
<p>Specifically, the chance of \(x\) <code class="language-plaintext highlighter-rouge">Heads</code> is:</p>
\[p(x \tt{ Heads}) = \frac{\langle \tt{number of ways to get exactly x Heads} \rangle}{\langle \tt{possible outcomes}\rangle}\]
<p>Let’s write the numerator as \(\binom{X}{x} = m_x\), so \(p(x) = \binom{X}{x}/M = \frac{m_{x}}{M}\). Thus, each macrostate “\(x\) Heads” corresponds to \(m_{x}\) possible microstates, out of the \(M\) total possible microstates, with \(p(x \; \tt{Heads}) = \frac{m_{x}}{M}\). It’s clear that \(\sum m_{x} = \sum \binom{X}{x}\) has to equal \(M = 2^{X}\) for this to be a probability distribution, and it does.</p>
<p>In this case we can clearly write down the entropy of learning a specific microstate (it’s \(\log M\)) or any particular macro state (it’s \(\log m_x\)). Thus, if we learn a macrostate \(x\), we have learned \(\log m_x\) bits of information, and there are \(\log M - \log m_x\) bits <em>left</em> which we still do not know. So the entropy of the macrostate distribution is the difference:</p>
\[\begin{aligned}
H[p(x)] &= \sum_{x} p(x) (\log M - \log m_x) \\
&= \sum_x p(x) \log \frac{M}{m_x} \\
&= - \sum_x p(x) \log \frac{m_x}{M} \\
&= - \sum_x p(x) \log p(x) \\
&= \bb{E}[ - \log p(x) ]\end{aligned}\]
<p>This is basically the idea. The underlying distrubution takes \(\log M\) bits to specify, and after various outcomes we are left with, on average, \(\log m_x\) bits left to learn, so we learn, on average, \(\bb{E}[\log M - \log m_x]\) bits. Basically:</p>
\[H[p(x)] = \bb{E}[\langle \tt{information learned} \rangle] \\
= \langle \tt{total information} \rangle - \bb{E}[\langle \tt{remaining unknown information} \rangle]\]
<p>We see that the quantity \(I(x) = - \log p(x) = \log \frac{1}{p(x)}\) term can be understood as: “if you tell me we’re in state <code class="language-plaintext highlighter-rouge">x</code>, I still have to learn \(\log m_{x}\) bits of data to get to the exact microstate, meaning that you just told me \(\log M - \log m_{x}\) bits of data.” (This quantity is called the <a href="https://en.wikipedia.org/wiki/Self-information">self-entropy</a> or “surprisal” of outcome \(x\).)</p>
<hr />
<p>This shows that the two \(p\)s in the usual expression of entropy mean <em>different things</em>. Really, the usual formula is just a particular expression for \(\bb{E}[I(x)]\), where \(x \in X\) is an event that can happen with probability \(p\). I think it should be viewed as incidental that the expression for \(I(x)\) happens to include \(p(x)\) also: there’s nothing about \(H = \bb{E}[\tt{information}]\) that requires the information come in the form of probabilistic events. There’s a probability distribution for <em>how much information we learn</em>, but it doesn’t matter where that information <em>comes</em> from.</p>
<p>For instance: if I just decide to tell you 1 bit of information with probability \(\frac{1}{2}\) and 100 bits with probability \(\frac{1}{2}\)”, you can still calculate an entropy \(H = \frac{1}{2}(1 + 100) = 50.5\) bits. It doesn’t matter that those bits themselves come from occurences of events with probabilities. It’s just <em>any</em> information.</p>
<p>For a more interesting example: there’s no requirement that the macrostate descriptions be exclusive, as probability distributions are. If I flip 100 coins, maybe I’ll tell you “whether I got more than 50 heads” (1 bit of information) and “the exact sequence of results” (100 bits of information) with equal probability. That’s still \(50.5\) bits of data, but it’s not a probability distribution over the set of possible macrostate descriptions – some of the macrostates overlap.</p>
<p>(This is well-known but I had not really understood it before today, and I think it could use more emphasis. In fact it was Shannon’s original <a href="https://en.wikipedia.org/wiki/Channel_capacity">model</a> anyway.)</p>
<p>I think it is appropriate to view entropy <em>not</em> as “the data required to encode a stream of events”, but “the data required to re-encode a stream into another language”:</p>
<ul>
<li>\(H[p]\) tells us how much information we learn from moving from the ‘trivial language’ (“something happened!”) to another language (“there are \(x\) heads”).</li>
<li>\(I(x)\) is the remaining information required to move from a macrostate language to a microstate one (“the exact sequence was…”).</li>
<li>there is no reason we couldn’t change together more description languages.</li>
<li>the unit of “bits” means “the data required to re-encode this sequence of labeled events into binary code”.</li>
</ul>
<p>This interpretation becomes important when we try to generalize our equation for entropy to continuous distributions, because we <em>can’t</em> encode a stream of infinite-precision events perfectly.</p>
<hr />
<h2 id="3-the-continuous-analog-of-entropy">3 The Continuous Analog of Entropy</h2>
<p>The naive way to generalize \(- \sum_x p(x) \log p(x)\) to a continuous distribution \(p(x)\) would be</p>
\[H[p] \stackrel{?}{=} - \int_{\bb{X}} p(x) \log p(x) dx\]
<p>But we realize that this can’t be right, using the intuition from above. The <em>real</em> definition is \(\bb{E}[I(x)]\), and it’s no longer true that \(I(x) = - \log p(x)\) when \(x\) is continuous. The probability that \(x\) equals an exact value is \(0\), and so we would get that \(I(x) = - \log 0 = \infty\)! Specifying the <em>exact</em> value would take ‘infinite’ information.</p>
<p>The problem is that \(p(x)\) would be promoted to a density function, and we need to use \(p(x) dx\) to compute any actual probabilities. Suppose we take as our ‘events’ the occurence of \(x\) in a range \((a,b)\):</p>
\[P[x \in (a,b)] = \int_{(a,b)} p(x) dx\]
<p>Thus:</p>
\[I(x \in (a,b)) = - \log \int_{(a,b)} p(x) dx\]
<p>Which seems reasonable. But what if \((a,b) \ra (a,a)\)?</p>
\[\lim_{\e \ra 0} - \log \int{(a, a + \e)} p(x) dx\]
<p>if \(p(x)\) has antiderivative \(P(x)\), then \(\int_{(x,x+\e)} p(x') dx' = P(x + \e) - P(x) \approx \e p(x)\), giving:</p>
\[I(x \in (a,a)) \stackrel{?}{=} - \log \e p(x) = \log \frac{1}{\e} - \log p(x) = \infty - \log p(x)\]
<p>What does that mean?</p>
<p>Actually, it’s pretty reasonable. Of course it takes infinite information to zoom in completely on a point \(a\) – we have measured a value with infinite granularity, and thus gained infinite information! It seems as if there are two parts to the information: an infinite part, due to the “information of the continuum”, and a finite part, due to the non-uniformity of the probability distribution, which has the same \(-\log p(x)\) form that discrete information has.</p>
<p>Consider a uniform distribution \(\cal{U}(0, 1)\) on \((0,1)\), with density function \(1_{x \in (0,1)}\), it makes sense that “dividing the range in half” takes exactly 1 bit of data (to specify which half):</p>
\[H[\cal{U}(0, \frac{1}{2})] = H[\cal{U}(0, 1)] - \log \frac{1}{2} = H[\cal{U}(0, 1)] + 1\]
<p>And formulas like this one work even if specifying the <em>exact</em> value would take infinite data due to the continuum.</p>
<hr />
<p>Since we can’t write down a finite value for \(U = H[\cal{U}(0, 1)]\), perhaps we can instead just reduce the distribution for \(P\) to a function of \(U\). We measure “how different \(p(x)\) is from uniform”, without fully reducing it the information required to specify exact values for \(x\). Suppose we partition the space into tiny buckets of width \(\D x\), and then later we will let the bucket size go to 0.</p>
<p>The uniform distribution \(\cal{U}(0,1)\) will contain \(\frac{1}{\D x}\) buckets, and thus have \(U(\D x) = - \log \D x = H_{\D x}[\cal{U}(0,1)]\) entropy. A uniform distribution \(\mathcal{U}(a,b)\) would contain \(\frac{b-a}{\D x}\) buckets, so specifying an exact bucket would take \(\log (b-a) - \log(\D x)\) information. Even if \(b-a < 1\), this is true for some sufficiently small bucket size. The first term is unaffected by the partitioning, and the second is \(U(\D x)\). As \(\D x \ra 0\), this becomes “entropy of the continuum”.</p>
<p>Thus, for a non-uniform distribution \(p(x)\), we just need to zoom in until \(p(x) \approx p(x + \D x)\), so that the distribution over each bucket is basically uniform. The information to specify a bucket of with \(\D x\) is:</p>
\[\begin{aligned}
I(x, x+\D x) &= - \log P[x, x+\D x] \\
&= - \log \int_{(x, x + \D x)} p(x') dx' \\
&\approx - \log (p(x) \D x) \\
&= - \log p(x) + U(\D x)
\end{aligned}\]
<p>And therefore the entropy of a continuous distribution can be given by:</p>
\[\begin{aligned}
H[p] &= \bb{E}[I(x)] \\
&= \int p(x) (- \log p(x) + U(\D x)) dx \\
&= - \int p(x) \log p(x) dx + U \\
&= - \int p(x) \log p(x) dx + H[\cal{U}(0,1)]
\end{aligned}\]
<p>The first term is the “naive” entropy formula. The second is the “entropy of the continuum”, and must be ignored. Put differently, we can’t <em>truly</em> compute \(H[p]\); what we actually compute is:</p>
\[- \int p(x) \log p(x) dx = H[p] - H[\cal{U}(0,1)]\]
<p>This is the entropy associated with “changing languages” into a bunch of tiny buckets of arbitrarily small width. Or, if you prefer, it’s the change of variables which makes \(p(x)\) by distorting the \(x\) axis. The same argument seems to work if our variable isn’t continuous, but is just granular at a much smaller length scale than we’re can see. Then we could find a meaningful value for the second term, but we’ll still want to discard it.)</p>
<p>This value \(- \int p(x) \log p(x) dx\) is called <a href="https://en.wikipedia.org/wiki/Differential_entropy">differential entropy</a>, and it’s not the limit of discrete formula, but as we see it is still meaningful.</p>
<p>Note that differential entropy <em>can</em> be negative, because it’s easy to write down a probability distribution on a continuous space that has <em>less</em> entropy than \(\cal{U}(0,1)\): for example, any \(\cal{U}(a,b)\) with \((b-a) < 1\). I assume that saying distribution \(A\) has negative entropy relative to distribution \(B\) means that you encode \(B\) into \(A\) efficiently, rather than the other way around!</p>
<p>An important point: the differential entropy is <em>not</em> invariant under changing variables for \(p(x)\). Why? Because it’s relative to a uniform distribution <em>on the variable \(x\)</em>. The differential entropy on another variable \(u = f(x)\) may rescale \(\cal{U}(0,1)\). We would need to also compute the entropy due to rescaling \(H[\cal{U}(0,1)]\). The most we can say is that:</p>
\[H[p(y)] - H[\mathcal{U}(y)] = H[p(x)] - H[\mathcal{U}(x)]\]
<p>Which does <em>not</em> mean that \(H[p(y)] = H[p(x)]\).</p>
<aside id="units" class="toggleable" placeholder="<b>Aside</b>: Units in logarithms <em>(click to expand)</em>">
<p>By the way: what should we make of a logarithm of a quantity that has units, like \(\log p(x)\)? Since \(p(x) dx\) is unitless, \(p(x)\) presumably has units of \(\tt{length}^{-1}\). and so if the units on \(x\) are called \(L\), then the units something like \(p(x) = 2x\) must really mean \(p(x) = \frac{2x}{L^2}\).</p>
<p>But if \(p(x)\) is not unitless, what does it mean to take \(\log p(x) \stackrel{?}{=} \log 2x - \log L^2\)? I really don’t know. But for that matter, it doesn’t even make sense, units-wise, to take \(\log \frac{L}{L} = \log L - \log L = 0\). Apparently logarithms switch to some sort of ‘log units’ which are cancelled out by subtracting rather than dividing. Something like how we divide lengths but subtract angles (after all, angles emerge from logarithms of complex numbers).</p>
<p>I think the explanation is that there is another factor of \(\log L^2\) coming from the \(H[\cal{U}]\) term which cancels out the \(\log L^2\) from \(-\int p(x) \log p(x) dx\). And I think that this lack of balancing of units is why (or at least, is a huge sign of why) \(H[p]\) requires the continuum term in order to be preserved under changes of variable.</p>
</aside>
<p>If this wasn’t so long, I’d go on and talk about <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">relative entropy</a>, which generalizes this idea of the entropy of a distribution relative to another. I also need to learn it first. Maybe another time!</p>
<hr />
<h2 id="4-final-thoughts">4 Final Thoughts</h2>
<p>I think it’s funny that entropy seems so important to us. The entire theory is just, like, ‘the logarithm of probability theory’. If you have two probability spaces that take \(A\) and \(B\) possible states, you could write that the compression ratio is \(R = \frac{A}{B}\), or you could write that the entropy is \(\log R = \log A - \log B\). If you have two encodings which multiply information density by \(X\) and \(Y\), they compose to multiply by \(XY\), and thus reduce information by \(-\log X - \log Y\). Etc. They really are the same thing.</p>
<p>A lot of this suggests that the the right way to combine probability distributions (or re-encodings) is to <em>multiply</em> their probability distributions, and the reason this uses so many logarithms is that we are pretty good at adding numbers together in expectations (cf integrals, expectation values), and not very good at <em>multiplying</em> them. We’d really like some sort of ‘multiplicative’ expectation value, but we aren’t used to using that so we take logarithms to make everything easier to write!</p>
<p>In an expectation \(\bb{E}[x]\) we compute \(\sum x p(x)\). The ‘multiplicative expectation’ operation, \(\bb{G}[p]\), would presumably multiply everything together weighted by its probability:</p>
\[\bb{G}[x] = \prod x^{p}\]
<p>On a discrete uniform distribution this is just the geometric mean. For a continuous analog would would need a sort of ‘multiplicative integral’, which are a thing, albeit an obscure one. One strong reason for this is that Bayes’ rule in probability is already a multiplication (of probabilities expressed as odds \((a:b) \times (c:d) = (ab : cd)\)), so there are other places where it seems like probability is more naturally expressed in multiplicative terms than additive ones.</p>
<p>Entropy in terms of \(\bb{G}\) is just \(\log \bb{G}[p^{-1}] = \bb{E}[-\log p]\), and we could define everything in this page using \(\bb{G}\), it would just look less familiar. Moreover, I assume there are similar versions for any other <a href="https://en.wikipedia.org/wiki/Generalized_mean">generalized mean</a>.</p>
<p>Basically I suspect that we use logarithms to write things in bits, flattening all our ratios out into sums, only because we happen to be better at addition than multiplication. Something to think about.</p>
<hr />
<p>Also, I wonder if more can be said about exactly <em>how</em> we map languages into each other. When we say a space \(P\) can be represented by \(H[P]\) bits, we literally mean it: there exists a function with the property of mapping strings of bits to elements with an average ratio of \(H[P]\), and no function can exist that does better than that. But I’m curious about trying to implement these mappings on, say, different kinds of formal languages. For instance – the space of sequences of coin flips until tails \(\{ \tt{T}, \tt{HT}, \tt{HHT}, \ldots\}\) clearly is represented by a regular language <code class="language-plaintext highlighter-rouge">H*T</code>. Maybe we can make some kind of more correspondence between a state machine of binary strings to a state machine of this sequence as the encoding.</p>
<hr />
<p>Also, I’m annoyed that the seemingly coherent concept of differential entropy makes <em>yet another example</em> where it seems that our fear of infinities is a real problem. Like, can’t we find a better way to handle them than carefully trodding through limits and trying to cancel them out?</p>
<hr />
<p>Also, I wonder if the quantum mechanical <a href="https://en.wikipedia.org/wiki/Von_Neumann_entropy">version of entropy</a> is more easily understood in terms of being a description ‘relative to a uniform distribution’, like I did above. Because, uh, everyone seems pretty happy just throwing their hands up when they see negative values for differential entropy without trying to interpret them. Ah well.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:functional" role="doc-endnote">
<p>A function that acts on functions. It’s common to write them with square brackets, \(H[]\) instead of \(H()\), to remind of this. The expectation value \(\bb{E}[]\) is another functional. <a href="#fnref:functional" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Blogging2018-01-02T00:00:00+00:00http://alexkritchevsky.com/2018/01/02/today<p>In 2018 I am going to write. Mostly: because I don’t remember anything unless I write it out for myself. And a little bit: because I have a lot I want to say.</p>
<p>Update: cool, I actually did some writing in 2018.</p>