Alex Kritchevsky2024-06-13T12:45:04+00:00http://alexkritchevsky.com/blogAlex Kritchevskyalex.kritchevsky@gmail.comThe Essence of Lagrange Multipliers2024-06-10T00:00:00+00:00https://alexkritchevsky.com/2024/06/10/lagrange-multipliers<p>In which we attempt to better understand the classic multivariable calculus optimization problem.</p>
<!--more-->
<p><a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange Multipliers</a> are what you get when you try to solve a simple-sounding problem in multivariable calculus:</p>
<blockquote>
<p>Maximize \(f(\b{x})\)<br />
Subject to the constraint that \(g(\b{x}) = c\)</p>
</blockquote>
<p>Lagrange multipliers are a trick to solving this. The trick is to instead maximize \(L = f(\b{x}) + \lambda (g(\b{x}) - c)\) for both \(\b{x}\) and a made-up variable \(\lambda\), by solving \(\del L = \p_{\lambda} L = 0\) instead. Equivalently, “notice that the solution will obey \(\del f \propto \del g\), rather than \(\del f = 0\) like it would if there was no constraint, and then invent \(L\) to enforce this.”</p>
<p>I’m told that Lagrange multipliers show up all over mathematics and are a widely used technique for solving real-world problems. Couldn’t tell you much about that. But I care about them for three reasons:</p>
<p>One, the explanation for how to solve them that you get in undergraduate calculus is very philosophically unsatisfying. I don’t like techniques that arising from “noticing” something and then proving it works. Depending on your background noticing it might be easier or difficult, but in either case it’s not satisfying for a problem to be solved by a trick. Instead the insight should somehow emerge from a natural calculation.</p>
<p>Two, I am very interested in the concept of “generalized inverses”, of which Lagrange multipliers include several great examples (to be demonstrated shortly). The algebra of these is a bit unfamiliar so it’s helpful to play with some examples. More generally I think there are a few concepts (generalized inverses, pseudoinverses, dual bases, vector division, frames) that ought to be more widely used, and I intend for this to be an example of why.</p>
<p>Three, various applications of Lagrange multipliers in physics (Lagrangian mechanics, QFT, statmech) seem to imply that Lagrange multipliers are an incredibly deep and important concept, far beyond their initial impression, and I want to understand how and why.</p>
<p>Disclaimer: this is <em>not</em> a pedagogical treatment of the subject. It’s me doing it in a weird way to get a chance to play with generalized inverses and some other weird ideas. Consider yourself warned.</p>
<hr />
<h1 id="1-lagrange-multipliers-as-inverting-a-projection">1. Lagrange Multipliers as inverting a projection</h1>
<p>Here is what I think is the most intuitive explanation of Lagrange multipliers. It is somewhat more complex than the standard explanations, but worth it because it’s “natural” in a way that most explanations are not. Maybe someday it will be not be viewed as more complex when everyone’s used to doing this kind of math. For the sake of being legible to a broader range of backgrounds, I’ll start with some exposition about multivariable functions and how to think about the optimization problem.</p>
<p>Okay. We wish to find the maximum value of \(f(\b{x}): \bb{R}^n \ra R\) subject to the constraint that \(g(\b{x}) = c\). (In general we’ll be working with functions on \(\bb{R}^n\), but when writing out examples I’m just going to act like they’re in \(\bb{R}^3\) to save on notation.)</p>
<p>We’ll assume that \(f\) and \(g\) are both well-behaved smooth functions and that \(\del g \neq 0\) anywhere, so it defines a <a href="https://en.wikipedia.org/wiki/Regular_surface_(differential_geometry)">regular surface</a>, which we’ll call \(G\). Being regular basically means that it doesn’t have sharp corners or, like, glitches, anywhere. Picture a nice smooth shape, like a sphere.</p>
\[G = g^{-1}(c) = \{ \b{x} \, \| \, g(\b{x}) = c \}\]
<p>Since \(G\) is defined by the solutions to a single constraint \(g\), it has a single tangent vector \(\del g\). The change of \(g\) along a vector \(\b{v}\) given by its directional derivative, which is the dot product with \(\del g\): \(dg(b{v}) = \del g \cdot \b{v}\). Hence \(\del g\) is the <em>only</em> direction along which the value of \(g\) changes. Along the other \((n-1)\) dimensions it does not change value, so if \(g(\b{x}) = c\) at some point there are \((n-1)\) directions you can move along which it <em>stays</em> at \(c.\) Hence \(G\) is an \((n-1)\)-dimensional surface. For instance, a circle or line in \(\bb{R}^2\), or a sphere or plane in \(\bb{R}^3\). Most of this argument will work for \(G\) of any dimension, and in the next section we’ll repeat this for more constraints, which makes \(G\) lower-dimensional. But the algebra gets more complicated. Better to start with one constraint and \((n-1)\)-dimensional \(G\).</p>
<p>We wish to find the maximum of \(f\) on \(G\). How?</p>
<p>In 1d calculus we would look for the maximum of \(f\) at points that have \(\frac{df}{dx} = 0\). Maybe those points are a maximum, or a minimum, or otherwise just a stationary point where it becomes flat for a while but keeps going the same way afterwards (we’d have to check the second derivative to know). And if they’re a maximum, maybe they’re the global maximum or maybe not, we’d have to check. In any case, those would be the points that we’re interested in.</p>
<p>Similarly, for a multivariable function in the absence of a constraint, we would search for a maximum by looking for points that have gradient \(\del f = (f_x, f_y, f_z) = 0\), and we’d test if they’re a local maximum by looking at the signs of the eigenvalues of the second derivative. All negative means it’s a maximum, because the function decreases in every direction (equivalently: along any 1d slice it has negative second derivative). And of course we’d have to compare all the points we found and see which one is the global max, etc.</p>
<p>When we limit to points on the surface \(G\), we are not necessarily interested in the local or global maxima of the whole function \(f\) anymore. A global maximum point of \(f\) would still be a maximum if it <em>happened</em> to be on \(G\), but if it did not lie on \(G\) then we would not care about it at all. Meanwhile the maximum that’s on \(G\) may not have \(\del f = 0\) at all; it could just be some random value in the middle of \(f\)’s range.</p>
<p>Example: suppose \(G\) is the surface \(g(x,y) = x^2 + y^2 = R^2 = c\), a circle of radius \(R\) around the origin, and suppose \(f\) the function we are maximizing is just \(f(x) = x\), the \(x\) coordinate. There is no global maximum (\(f\) increases as you head in the \(+x\) direction forever), but the maximum on \(G\) is clearly the point \((x,y) = (R, 0)\), since it’s the most \(x\)-ward point on the circle. Yet the gradient at that point is \(\del f = (1,0)\), which is certainly not zero.</p>
<p>The reason that \(\del f = 0\) is no longer condition for a maximum is that that we are really interested in only \(f\)’s derivative <em>when \(f\) is restricted to</em> \(G\). As we move in directions that <em>are</em> on \(G\), how does \(f\) change? If it’s constant, then we are at a local stationary point of \(f\). In the circle example: at the solution point \((x,y) = (R, 0)\), the gradient of \(f\) is \(\del f = (1,0)\), but the circle is going in the \(\pm \hat{y} = (0, \pm 1)\) direction, so the gradient of \(f\) <em>along</em> \(G\) is \(0\).</p>
<p>How do we express the derivative of \(f\), but restricted to \(G\), as an equation? What we are looking for is called the <a href="https://en.wikipedia.org/wiki/Covariant_derivative">covariant derivative</a> of \(f\), with respect to the surface \(G\), written \(\del_G f\). It’s simply the regular derivative but projected onto the surface, which chops off any change that isn’t along \(G\):<sup id="fnref:covariant" role="doc-noteref"><a href="#fn:covariant" class="footnote" rel="footnote">1</a></sup></p>
\[\del_G f = \proj_G \del f\]
<p>And the condition for the maxima of \(f\) is that the covariant derivative is zero:</p>
\[\boxed{\del_G f = 0}\]
<p>I’m using \(\proj_G\) to mean the <a href="https://en.wikipedia.org/wiki/Vector_projection">vector projection</a> operator, which takes a vector to another vector. It lops off components that don’t lay in the surface. For instance we could project a vector \(\b{v} = (v_x, v_y, v_z)\) onto the \(xy\) plane, which would be given by \(\proj_{xy}(\b{v}) = (v_x, v_y, 0)\).</p>
<p>For some reason people usually think of projections like \(\proj_G\) as abstract “operators”, basically functions on vectors. But it is representable as a matrix, and is easier to think about in that form. When we’re thinking of it as a matrix I’ll write a dot product symbol instead, as \(\proj_G \cdot \del f\).</p>
<p>So what’s the matrix form of \(\proj_G\)? Well, the information we have about \(G\) is that \(\del g\) points in the direction orthogonal to \(G\). Therefore to get only the parts of a vector that lay <em>on</em> the surface \(G\), we just have to remove the parts that <em>aren’t</em> on \(G\), which is the projection onto \(\del g\)</p>
\[\proj_G \del f = (I - \proj_{\del g}) \cdot \del f\]
<p>(With \(I\) as the identity matrix.) The projection onto the gradient \(\proj_{\del g}\) is another matrix, which we can write down more easily. What it does to vectors is perhaps familiar from multivariable calculus:</p>
\[\proj_{\del g} (\b{v}) = \frac{\del g \cdot \b{v}}{\| \del g \|^2} \del g\]
<p>Although I prefer to write it in this more symmetric way:</p>
\[\proj_{\del g}(\b{v}) = \frac{\del g}{\| \del g \|} [ \frac{\del g}{\| \del g \|} \cdot \b{v}]\]
<p>With \(\b{n} = \frac{\del g}{\| \del g \|}\) it’s</p>
\[\proj_{\del g}(\b{v}) = (\b{n} \cdot \b{v}) \b{n}\]
<p>A more sophisticated to write this, without having to specify the vector \(\b{v}\), is with a tensor product:</p>
\[\proj_{\del g} = \frac{\del g}{\| \del g \|} \o \frac{\del g}{\| \del g \|} = \b{n} \o \b{n}\]
<p>And it is cleaner if we also adopt <a href="https://en.wikipedia.org/wiki/Dyadics">dyadic notation</a>, in which we shorten \(\b{n} \o \b{n}\) to \(\b{nn}\):</p>
\[\proj_{\del g} = \b{nn}\]
<p>All of these are ways of writing the projection \(\proj_{\del g}\) onto the vector \(\del g\). It doesn’t matter which one you use. The important part is that they express \(\proj_{\del g}\) as a matrix, and then</p>
\[\proj_G \del f = (I - \proj_{\del g}) \del f = 0\]
<p>Is the constraint obeyed by \(\del f\) at its stationary points on \(G\).</p>
<p>Here is one more version. Suppose we happen to have a coordinate system \((u,v)\) on the surface \(G\); don’t ask me how we got it. Then locally there is a frame of unit vectors \((\b{u}, \b{v}, \b{n})\) with \(\b{n} = \frac{\del g}{\| \del g \|}\) as before. The identity matrix is then \(I = \b{uu} + \b{vv} + \b{nn}\), which is equivalent to writing \(\text{diag}(1,1,1)\) in the \((u,v,n)\) coordinate system. Then we can write \(I - \b{nn} = \b{uu} + \b{vv} = \text{diag}(1,1,0)\). That is, there is a basis \((\b{u}, \b{v}, \b{n})\) in which these are true:</p>
\[\begin{aligned}
\proj_G &= I - \b{nn} \\
&= (\b{uu} + \b{vv} + \b{nn}) - \b{nn} \\
&= \b{uu} + \b{vv} \\
&= \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{pmatrix}
\end{aligned}\]
<p>Which is so simple that it felt worth mentioning. After all this is all a projection is: if it removes one dimension from a vector, of course there’s a basis in which it preserves the other two dimensions but gets rid of one, right? It is often helpful to imagine coordinates like this to make the algebra more concrete (much more on this some other day…). But mostly we will not use this form.</p>
<hr />
<p>So the condition on points on \(G\) which maximize \(f\) is that</p>
\[\del_G f = \proj_G \del f = 0\]
<p>This doesn’t require that \(\del f = 0\) itself. Yes, \(\del f\)’s components in the \((n-1)\) directions on the surface \(G\) are zero, but there’s one more direction besides those, the direction \(\del g\), along which it can be whatever it wants. Therefore at the solutions \(\del f\) only has to be proportional to that remaining direction, \(\del g\):</p>
\[\del f \propto \del g\]
<p>The proportionality constant \(\lambda\) is an unspecified variable, called the Lagrange Multiplier.</p>
\[\del f = \lambda \del g\]
<p>When we solve the equation we’ll come up with both a point \(\b{x}^*\) and a value of \(\lambda^*\) as the solution. \(\lambda^*\) has an interesting interpretation which we’ll talk about later, but for now just notice that its value doesn’t matter for this to be a solution. It just shows up a side effect of setting all the other components of \(\del f\) to zero.<sup id="fnref:zero" role="doc-noteref"><a href="#fn:zero" class="footnote" rel="footnote">2</a></sup></p>
<p>There is a nicer way to come up with \(\del f = \lambda \del g\):</p>
<p>We had written \(\proj_{G} \cdot \del f = 0\). Well, the projection operator has a simple <a href="https://en.wikipedia.org/wiki/Generalized_inverse">generalized inverse</a>:<sup id="fnref:gen" role="doc-noteref"><a href="#fn:gen" class="footnote" rel="footnote">3</a></sup> since it takes one dimension, the direction \(\del g\), to \(0\), then the preimage of \(0\) can have any component along \(\del g\). We write this as a free parameter \(\lambda\):</p>
\[\begin{aligned}
\proj_G \del f &= 0 \\
\del f &= \proj_G^{-1}(0) \\
&= \lambda \del g \\
\end{aligned}\]
<p>This is the same as just “noticing it”, except that it treats dividing through by \(\proj_G\) as an explicit algebraic operation. To me that’s a big improvement. I like how that makes the free \(\lambda\) parameter show up through what could be rote algebra instead of any sort of trick. Invert a projection, get a free parameter. Easy. Also it’s easy to see how it generalizes: if, for instance, \(\proj_G\) projected out two dimensions instead, we’d get two free parameters. More on that in a second.</p>
<p>So far \(\del_G f = 0\) was a condition that we expected to be fulfilled at certain points, the maxima \(\b{x}^*\). <em>At</em> those points we’ll write everything with asterixes: \(f^* = f(\b{x}^*)\), \(g^* = g(\b{x}^*) = c\), \(\del f^* = \del f(\b{x}^*)\), \(\del g^* = \del g(\b{x}^*)\). Then the relation at a solution is</p>
\[\del f^* = \lambda^* \del g^*\]
<p>We can just solve for \(\lambda^*\):</p>
\[\lambda^* = \frac{\del f^*}{\del g^*}\]
<p>Yes, that’s division by a vector. It’s an unorthodox notation that I like. The meaning is that \(\b{a}/\b{b} = (\b{a} \cdot \b{b})/\| \b{b} \|^2\), so this really says that:</p>
\[\lambda^* = \frac{\del f^* \cdot \del g^*}{\| \del g^* \|^2}\]
<p>Note that had we included one more factor of \(\del g^*\), it would turn this back into the expected projection:</p>
\[\lambda^* \del g^* = \frac{\del f^* \cdot \del g^*}{\| \del g^* \|^2} \del g^* = \proj_{\del g^*} \del f^*\]
<p>This tends to happen when you extend division to vectors and matrices: \(\frac{\b{b}}{\b{a}} \b{a} = \proj_{\b{a}} \b{b}\). We’ll see a lot more of it in a moment.</p>
<p>So that’s the value of \(\lambda^*\): it’s the ratio of the derivatives of \(f\) and \(g\). We’ll talk about what it means later. First let’s do this again with more than one constraint because it gets more interesting.</p>
<hr />
<h1 id="2-handling-multiple-constraints-with-the-pseudoinverse">2. Handling Multiple Constraints with the Pseudoinverse</h1>
<p>Now suppose we have \(k\) constraints \(\{\, g_i(\b{x}) = c_i^* \, \}\) instead of just one, and we’re trying to solve:</p>
<blockquote>
<p>maximize \(f(\b{x})\)<br />
subject to all of the \(g_i(\b{x}) = c_i\) at once.</p>
</blockquote>
<p>Each constraint defines a surface \(g_i^{-1}(c_i)\), and the intersection of all those surfaces is the surface \(G\) that we want \(f\) to live on:</p>
\[G = g_1^{-1}(c_1) \cap g_2^{-1}(c_2) \cap \ldots\]
<p>We’ll need a few assumptions. We assume that the surfaces defined by the constraints actually <em>do</em> all intersect, and that at the points where they intersect, they’re not parallel to each other (this would happen, for instance, with two spheres that touch at a point). Equivalently, the set \(\{ \del g_i \}\) is linearly independent and \(\text{\span}(\{ \del g_i \})\) is a \(k\)-dimensional subspace of \(\bb{R}^n\). This second assumption is not critical but it makes the math easier for now.</p>
<p>Each constraint defines an \((n-1)\) dimensional surface, e.g. the surface of a sphere in \(\bb{R}^3\). In general the intersection of two such surfaces can have various dimensions: \((n-2)\), such as two spheres intersecting in a circle, or \((n-1)\) again, such as the <em>same</em> sphere intersected with itself, or they could only intersect in a point (\(0\)-dimensional?) or not at all (…no idea what to call that). But by our assumptions that they do intersect and they’re not parallel when they do, we get to assume the intersection of the \(k\) constraints is \((n-k)\)-dimensional as well.</p>
<p>The stationary points of \(f\) on \(G\), are still given by setting the covariant derivative with respect to this surface \(G\) to zero:</p>
\[\del_G f = \proj_G \del f = 0\]
<p>For exactly the same reason as before: a point at which \(f\) is maximized has no direction on \(G\) in which you can move to increase it, hence its derivative <em>on \(G\)</em> is zero.</p>
<p>Also as before, we can in principle solve this by inverting the projection:</p>
\[\begin{aligned}
\del_G f &= 0 \\
\proj_G \del f &= 0\\
\del f &= \proj_G^{-1}(0)
\end{aligned}\]
<p>And again the solution is going to be a bunch of free parameters, one for each direction that the projetion erases. These directions form \(G_{\perp}\), the orthogonal complement to \(G\), which is spanned by gradients of the constraints: \(G_{\perp} = \text{\span}(\{ \del g_i \})\). So \(\del f\) is therefore a linear combination of the vectors in \(G_{\perp} = \text{span}(\del g_1, \del g_2, \ldots)\):</p>
\[\begin{aligned}
\del f &= \proj_G^{-1}(0) \\
&= \lambda_1 \del g_1 + \lambda_2 \del g_2 + \ldots \\
\end{aligned}\]
<p>Where the \(\{ \lambda_i \}\) are our Lagrange multipliers. Easy. I still like the generalized inverse. But last time we came up an explicit form for \(\lambda^*\): it was \(\lambda^* = \del f^*/\del g^*\). How would we do that here?</p>
<p>First, some more notations. We’ll write the list of constraints and list of constraint values as vectors:</p>
\[\begin{aligned}
G_{\perp} &= (g_1, g_2, \ldots, g_k) \\
G_{\perp}^* &= (g_1^*, g_2^*, \ldots, g_k^*) \\
&= (c_1, c_2, \ldots, c_k)
\end{aligned}\]
<p>And the gradients become a \(n \times k\) matrix (or is it \(k \times n\)? Eh, doesn’t matter):</p>
\[\del G_{\perp} = \{ \del g_1, \del g_2, \ldots \}\]
<p>The reason for writing the list of constraints as \(G_{\perp}\) instead of \(G\) is that \(\del G_{\perp}\) is the list of vectors <em>orthogonal</em> to the surface \(G\), which span the <em>subspace</em> \(G_{\perp}\), so we probably shouldn’t write them as \(\del G\). Also, it kinda makes sense: except at the solution values \(G_{\perp}^*\), the constraints describe points that <em>aren’t</em> on the surface \(G\).</p>
<p>In this notation the solution from before can be written as</p>
\[\begin{aligned}
\del f &= \proj_G^{-1}(0) \\
&= \lambda_1 \del g_1 + \lambda_2 \del g_2 + \ldots \\
&= \vec{\lambda} \cdot \del G_{\perp}
\end{aligned}\]
<p>This makes it clear that it is a linear equation of the form \(A \b{x} = \b{b}\), albeit for non-square \(A\). Here’s another way to write it. Starting from \(\proj_G \del f = 0\), instead of just inverting \(\proj_G\) and reasoning that it ought to be in \(\text{span}(\del G_{\perp})\), we could use \(\proj_G = I - \proj_{\perp G}\) and rearrange the terms:</p>
\[\begin{aligned}
(I - \proj_{\perp G}) \del f &= 0 \\
\del f &= \proj_{\perp G} \del f
\end{aligned}\]
<p>This is another way to write the condition on \(f\) at the solution, which is equivalent to saying that \(\del f \in \proj_G^{-1}(0) = \vec{\lambda} \cdot \del G_{\perp}\).</p>
<hr />
<p>We’d like an algebraic version of both of these: a way of writing \(\proj_{\perp G} \del f\) that explicitly gives the values of \(\vec{\lambda}\) in \(\vec{\lambda} \cdot \del G_{\perp}\). But it is not so obvious how to write down \(\proj_{\perp G}\). Maybe you’d guess something like \((I - \proj_{\del g_1} - \proj_{\del g_2} \ldots)\), but no. The problem is that all the constraints may be generally non-orthogonal, so we’d be projecting off the same components twice if we did that.</p>
<p>Example: consider if there are just two gradients \((\del g_1, \del g_2)\). The value we want for \(\proj_{\perp G}\) is something like “remove the projection onto \(\del g_1\), and then remove the projection onto \(\del g_2\) but only the part of it that we didn’t already remove along \(\del g_1\)”. If \(\del g_1 = \b{x}\) and \(\del g_2 = \b{x} + \b{y}\) then \(\proj_{\perp G}\) should be \(I - \b{xx} - \b{yy} = \b{zz}\). We could write it generically as:<sup id="fnref:gs" role="doc-noteref"><a href="#fn:gs" class="footnote" rel="footnote">4</a></sup> \(\proj_{\perp G} = I - \proj_{\del g_1} - [\proj_{\del g_2 - \proj_{\del g_1} \del g_2}]\), but that’s pretty hard to use. What’s the general form?</p>
<p>The answer is that we need to use something that acts like the inverse of the matrix \(\del G_{\perp}\), called the <a href="https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse">pseudoinverse</a>. I tend to write it as \(\frac{1}{\del G_{\perp}}\) or \((\del G_{\perp})^{-1}\) without making a distinction from the regular inverse, although most people use a different symbol such as \((\del G_{\perp})^+\). If \(\del G_{\perp}\) is \(k \times n\) then \(1/\del G_{\perp}\) is \(n \times k\). Using the pseudoinverse we can simply “divide through” by \(\del G_{\perp}\):<sup id="fnref:indices" role="doc-noteref"><a href="#fn:indices" class="footnote" rel="footnote">5</a></sup></p>
\[\begin{aligned}
\del f &= \vec{\lambda} \cdot \del G_{\perp} \\
&\Ra \\
\vec{\lambda} &= \frac{\del f}{\del G_{\perp}}
\end{aligned}\]
<p>Here is a blurb about it.</p>
<aside id="pseudoinverse">
<p><strong>The Pseudoinverse</strong></p>
<p>We are trying to solve the equation</p>
\[\del f = \vec{\lambda} \cdot \del G_{\perp}\]
<p>for \(\vec{\lambda}\). This has the form</p>
\[A \b{x} = \b{b}\]
<p>Where \(A = \del G_{\perp}\), \(\b{x} = \vec{\lambda}\), and \(\b{b} = \del f\). But \(A\) is not square (it’s \(k \times n\)), much less invertible, so we can’t just solve it by writing \(\b{x} = A^{-1}(\b{b})\). Or can we?</p>
<p>Consider what \(A \b{x} = \b{b}\) means. Suppose \(A = (\b{a}_1, \b{a}_2, \ldots)\) and that all of the \(\b{a}\) are linearly independent. Then we are searching a set of coefficients \(\b{x} = (b_1, b_2, \ldots)\) such that we can write</p>
\[\b{b} = A \b{x} = b_1 \b{a}_1 + b_2 \b{a}_2 + \ldots \tag{maybe works}\]
<p>This isn’t always possible, because maybe \(\b{b}\) is <em>not</em> a sum of the columns of \(A\) (that is, maybe \(\b{b} \notin \text{col}(A)\), the column space of \(A\)). What we can <em>always</em> do, though, is project \(\b{b}\) onto the column space of \(A\), which we’ll write as:</p>
\[\proj_A \b{b} = b_1 \b{a}_1 + b_2 \b{a}_2 + \ldots \tag{always works}\]
<p>The coefficients \((b_1, b_2, \ldots)\) always exist, and will be given by the pseudoinverse of \(A\):</p>
\[(b_1, b_2, \ldots) = \frac{\b{b}}{A}\]
<p>Normally people do not write the pseudoinverse like this; they like to write something like \(A^+(\b{b})\). They are reluctant to write something that’s not a true inverse as \(A^{-1}\). But I think this is the better meaning of the symbol: it’s much more general and reduces to the usual meaning in simple cases. So I’m going to write \(A^{-1}\) or \(1/A\) anyway.</p>
<p>Once we have those components we can multiply by \(A\) again to reconstruct \(\proj_A \b{b}\):</p>
\[\proj_A \b{b} = \frac{\b{b}}{A} \cdot A = b_1 \b{a}_1 + b_2 \b{a}_2 + \ldots\]
<p>Which, when \(\b{b} \in \text{span}(A)\), is just \(\b{b}\) again. Generally speaking multiplying by \(A^{-1}\) and then \(A\) again becomes the projection onto \(A\). In fact this is the defining quality of the pseudoinverse: that although \(A^{-1} A \neq I\) like a regular inverse, it does have \(A^{-1} A = \proj_A\) and \(A A^{-1} A = A\) again.</p>
<p>(If you’re curious: the actual construction of the pseudoinverse uses the wedge product: \(b_1 = \frac{\b{b} \^ \b{a}_2 \^ \b{a}_3 \ldots}{\b{a}_1 \^ \b{a}_2 \^ \b{a}_3 \ldots}\), \(b_2 = \frac{\b{a}_1 \^ \b{b} \^ \b{a}_2 \^ \ldots}{\b{a}_1 \^ \b{a}_2 \^ \b{a}_3 \ldots}\), etc, and then you recast it as a matrix. When \(k = n\) this becomes the matrix inverse exactly; the numerator is \(\text{adj}(A)\) and the denominator is \(\det(A)\). Another (simpler but less-insightful) way to construct it is as \(A^{-1} = (A^T A)^{-1} A^T\).)</p>
<p>By the way, another way of looking at \(1/A\) is that it is the matrix form of the <a href="https://en.wikipedia.org/wiki/Dual_basis">dual basis</a> of the columns of \(A\): its columns are “dual basis vectors” \(\{ \tilde{\b{a}}_i \}\) that have \(\tilde{\b{a}}_i \cdot \b{a}^{j} = \delta_i^j\). Then each component is just given by a dot product with a dual basis vector:</p>
\[\begin{aligned}
\tilde{\b{a}}_i \cdot \b{b} &= \tilde{\b{a}}_i \cdot (b_1 \b{a}_1 + b_2 \b{a}_2) + \ldots \\
&= b_1 (\tilde{\b{a}}_i \cdot \b{a}_1 )+ b_2 (\cancel{\tilde{\b{a}}_i \cdot \b{a}_2}) + \ldots \\
&= b_1 \\
\frac{\b{b}}{A} &= (\tilde{\b{a}}_1 \cdot \b{b}, \tilde{\b{a}}_2 \cdot \b{b}, \ldots) \\
&=(b_1, b_2, \ldots) \\
\end{aligned}\]
<p>Which makes the algebraic behavior nice and clear.</p>
<p>Note that all of this only works as written if the vectors in \(A\) are all linearly independent. Otherwise there would be multiple choices for the \(b_i\). For instance if \(A = (\b{a}_1, 2 \b{a}_1)\) then \(\proj_A \b{b} = b_1 \b{a}_1 + 2 b_2 \b{a}_1 = (b_1 + 2 b_2) \b{a}_1\) and there’s more than one way to select \(b_1\) and \(b_2\) to get the same result. If this happened we could still express the answer in terms of the full <em>generalized inverse</em> of \(A\): there would be even more free parameters to tell us <em>which</em> of the equivalent representations to use, something like \((b_1, b_2) = (\lambda, \frac{1 - \lambda}{2})\). But I’d rather not think about that.</p>
<p>The terms “pseudoinverse” and “generalized inverse” are not exactly standardized, but I’ve settled on a way of using them. A pseudoinverse inverts only the parts of \(A\) that are not projections, and produces a single value. A generalized inverse inverts all of \(A\), including the parts that are projections, and therefore produces free parameters. The general relationship between the pseudoinverse and the generalized inverse of a matrix \(A\) is that</p>
\[\begin{aligned}
A_{\text{generalized}}^{-1}(\b{b}) &= A_{\text{pseudo}}^{-1}(\b{b}) + (I - A_{\text{pseudo}}^{-1} A) \cdot \vec{\lambda}\\
&= A_{\text{pseudo}}^{-1}(\b{b}) + \proj_{\perp A} \cdot \vec{\lambda} \\
\end{aligned}\]
<p>For some vector of free parameters \(\vec{\lambda}\). Since \(A_{\text{pseudo}}^{-1} A = \proj_A\), the remainder \(I - A_{\text{pseudo}}^{-1} A = \proj_{\perp A}\) is the projection onto the nullspace of \(A\).</p>
<p>(Hm, it does seem like it would be nice have different notations for \(A_{\text{generalized}}^{-1}\) and \(A^{-1}_{\text{pseudo}}\). But the only thing I can think of is the \(A^+\) notation for the pseudoinverse, and I dislike that too much to use it.)</p>
<hr />
<p>There are two well-known examples of pseudoinverses which nobody normally calls by that name.</p>
<p>First, there’s the vector projection that I wrote earlier using vector division:</p>
\[\frac{\b{b}}{\b{a}} \b{a} = \frac{\b{b} \cdot \b{a}}{\| \b{a} \|^2} \b{a} = \proj_{\b{a}} \b{b}\]
<p>This is just the \(k=1\) case of the matrix pseudoinverse above:</p>
\[\frac{\b{b}}{A} A = \proj_A \b{b}\]
<p>Second, and this one is weird and dubious, there is the differential notation of calculus.</p>
<p>We learn to write \(\frac{df}{dx}\) as the derivative of a function, but we are always careful to point out that \(\frac{1}{dx}\) is not “actually” a fraction, and that multiplying by \(dx\) does not give \(df\) again. True, but it <em>is</em> a pseudoinverse. Or at least it acts like one. Suppose \(f = f(x,y)\). Then \(df = f_x dx + f_y dy\). We can write:</p>
\[\begin{aligned}
\frac{df}{dx} &= df \cdot \frac{1}{dx} \\
&= (f_x \d x + f_{y} \d y) \cdot \frac{1}{dx} \\
&= f_x \\
\frac{df}{dx} dx &= f_x \d x \\
&= \proj_{dx} df
\end{aligned}\]
<p>Where \(\frac{dx}{dx} = 1\) and we assume that \(\frac{dy}{dx} = 0\). Here I am being weird and treating \(\frac{1}{dx}\) as a vector in the same vector space as \(dx\). (If you know differential forms: it acts exactly like the partial derivative \(\p_x\).)</p>
<p>Also we can “solve” \(\frac{df}{dx} = 0\) by multiplying through again, with a generalized inverse this time:</p>
\[\begin{aligned}
\frac{df}{dx} &= 0 \\
df &= (\frac{1}{dx})^{-1} (0) \\
&= \lambda \d y \\
&= f_y \d y
\end{aligned}\]
<p>\((\frac{1}{dx})^{-1}\) is the inverse of a projection so the free parameter shows up to make it correct. Bit confusing, I know, but you get used to it.</p>
</aside>
<p>To summarize: we can write the projection onto the list of vectors \(\del G_{\perp}\) like this, using the pseudoinverse \(1/\del G_{\perp}\):</p>
\[\proj_{\del G_{\perp}} \del f = \frac{\del f}{\del G_{\perp}} \cdot \del G_{\perp}\]
<p>Which is equivalent to the Lagrange multiplier form, with \(\vec{\lambda} = \frac{\del f}{\del G_{\perp}}\):</p>
\[\proj_{\del G_{\perp}} \del f = \vec{\lambda} \cdot \del G_{\perp}\]
<p>And the condition satisfied by \(f\) at the maximum on \(G\), which was that \(\del_G f = 0\), is equivalent to saying that \(\del f\) is its projection onto \(\del G_{\perp}\):</p>
\[\del f = \vec{\lambda} \cdot \del G_{\perp}\]
<p>When the condition holds we have a stationary point \(\b{x}^*\), and we write</p>
\[\vec{\lambda}^* = \frac{\del f^*}{\del G^*_{\perp}}\]
<p>Which is the multi-constraint equivalent of what \(\lambda^* = \del f^* / \del g^*\) was for a single constraint.</p>
<hr />
<h1 id="3-the-meaning-of-l-part-1">3. The Meaning of \(L\), part 1</h1>
<p>So that is what’s going on with the Lagrange multiplier solution. Now let’s move on to some of the other issues. In particular, what is going on with that bizarre trick where you write \(L = f - \lambda g\) and compute \(\del L = 0\) instead?</p>
<p>Specifically the technique is this:</p>
<p>Instead of maximizing \(f(\b{x})\) on the surface \(g(\b{x}) = g^*\) over all values of \(x\), construct a new function called a “Lagrangian”:</p>
\[L(\b{x}, \lambda) = f(\b{x}) - \lambda (g(\b{x}) - g^*)\]
<p>or with multiple constraints:</p>
\[L(\b{x}, \vec{\lambda}) = f(\b{x}) - \vec{\lambda} \cdot (G_{\perp}(\b{x}) - G_{\perp}^*)\]
<p>and maximize that instead, with respect to both variables \(\b{x}\) and \(\lambda\). The condition \(\del_{\b{x}, \lambda} L = 0\) becomes:</p>
\[\begin{cases}
\del_{\b{x}} L = \del f - \lambda \del g &= 0 \\
\del_\lambda L = g(\b{x}) - g^* &= 0 \\
\end{cases}\]
<p>The second clause just encodes the original constraint again, while the first is the Lagrange multiplier constraint. Cute? But… why? It just seems like a hack pulled out of thin air. What is going on?<sup id="fnref:confuse" role="doc-noteref"><a href="#fn:confuse" class="footnote" rel="footnote">6</a></sup></p>
<p>Perhaps we can do better.</p>
<hr />
<p>First let’s adjust the notations a bit. So far I’ve been writing \(\del f\) everywhere. It will be nicer to write these as differentials instead of derivatives:</p>
\[\begin{aligned}
d_G f &= \proj_G d f \\
&= df - d_{\perp G} f \\
&= (I - \proj_{\perp G}) df \\
&= d f - \frac{d f}{d G_{\perp}} d G_{\perp} \\
\end{aligned}\]
<p>As a reminder the differential notation works like this, for \(d \b{x} = (dx, dy, dz)\):</p>
\[df = \del f \cdot d \b{x} = \frac{\p f}{\p \b{x}} \cdot d \b{x} = f_x d x + f_y d y + f_z d z\]
<p>If we had some coordinates \((u,v)\) on \(G\) and \((n)\) on \(G_{\perp}\), then we could write \(d_G f = f_u d u + f_v d v\) and \(d_{\perp G} = f_n d n\). We are just splitting the differential into a part parallel and perpendicular to \(G\). A generic way to write this for any dimension is</p>
\[df = d_G f + d_{\perp G} f = \frac{\p f}{\p G} dG + \frac{\p f}{\p G_{\perp}} dG_{\perp}\]
<p>Where \(\frac{\p f}{\p G} dG = \frac{\p f}{\p (u, v)} \cdot (d u, d v) = f_u \d u + f_v \d v\) and \(\frac{\p f}{\p G_{\perp}} dG_{\perp} = f_n dn\).<sup id="fnref:frame" role="doc-noteref"><a href="#fn:frame" class="footnote" rel="footnote">7</a></sup></p>
<p>These notations work even if we don’t <em>know</em> the \((u,v)\) coordinate system. This is why differentials are nicer: they make it clear that the derivatives are actually coordinate-independent and could happen in whatever coordinate system you like: they care only about the <em>surface</em> \(G\), not the specific implementation of \(G\) in particular coordinates.</p>
<p>So, \(d_G f\) is the differential version of \(\del_G f\), which we might call the covariant differential. It’s the part of \(f\)’s differential that lives on the surface \(G\).</p>
<p>An interesting thing we can do is attempt to linearly approximate \(f\) in terms of it. We start with the expansion of \(d_G f\):</p>
\[\begin{aligned}
0 &= d_G f \\
&= df - \frac{d f}{d G_{\perp}} dG_{\perp} \\
\end{aligned}\]
<p>But then fix the value of \(\frac{d f}{d G_{\perp}}\) to be a constant \(\vec{\lambda}\). Then approximate \(f\) to first-order around the point \(\b{x} = \b{x}^*\), exactly as if we were approximating \(f(x) = f(x^*) + f'(x^*) \int_{x^*}^x dx = f(x^*) + f'(x^*) (x - x^*)\) in 1d:</p>
\[\begin{aligned}
f(\b{x}) &= f(\b{x}^*) + \int_{\b{x}^*}^{\b{x}} d_G f \\
&= f(\b{x}^*) + \int_{\b{x}^*}^{\b{x}} df - \frac{d f}{d G_{\perp}} dG_{\perp} \\
&\approx f(\b{x}^*) + \int_{\b{x}^*}^{\b{x}} df - \vec{\lambda} \, dG_{\perp} \\
&= f(\b{x}) - \vec{\lambda} (G_{\perp} - G_{\perp}^*) \\
&= L(\b{x}, \vec{\lambda}) \\
\end{aligned}\]
<p>This feels close to the true meaning of \(L\): with the substitution, \(L(\b{x}, \frac{d f}{d G_{\perp}})\) is \(f\) with its change along \(G_{\perp}\) removed. I imagine writing this as \(\proj_G f\), the projection of a function onto a surface rather than a vector field. It’s “the local approximation to \(f\), but only the part on \(G\)”. As if somebody had said:</p>
<blockquote>
<p>instead of solving \(\del_G f = 0\)<br />
define \(L\) such that \(\del L = \del_g f\) around the solution, and set \(\del L = 0\) instead</p>
</blockquote>
<p>Another way of thinking about this: we can approximate a point \(\b{x}\) near \(\b{x}^*\) as</p>
\[\b{x} \approx \b{x}^* + \b{x}_G (G - G^*) + \b{x}_{G_{\perp}} (G_{\perp} - G_{\perp}^*)\]
<p>Where again we’ve written the point in some imaginary \(G\) coordinates (like the \((u,v)\) above) and \(G_{\perp}\) coordinates (which we have, they’re the values of the constraints). The \(\b{x}_G = \frac{d \b{x}}{d G}\) are the derivatives of the coordinate changes (Jacobians, if you want; hate that name). Naturally \(f\) can be expanded in terms of these:</p>
\[\begin{aligned}
f(\b{x})
&\approx f(\b{x}^* + \b{x}_G (G - G^*) + \b{x}_{G_{\perp}} (G_{\perp} - G_{\perp}^*)) \\
&\approx f(\b{x}^*) + (\del f) \frac{d \b{x}}{d G} (G - G^*) + (\del f) \frac{d \b{x}}{d G_{\perp}} (G_{\perp} - G_{\perp}^*) \\
&= f(\b{x}^*) + \frac{d f}{d G} (G - G^*) + \frac{d f}{d G_{\perp}} (G_{\perp} - G_{\perp}^*)
\end{aligned}\]
<p>Where we have used \((\del f) \b{x}_G = \frac{d f}{d \b{x}} \frac{d \b{x}}{d G} = \frac{d f}{d G}\). Then \(L(\b{x}, \frac{d f}{d G_{\perp}})\) is:</p>
\[\begin{aligned}
L &= f(\b{x}) - \frac{df}{dG_{\perp}} (G_{\perp} - G_{\perp}^*) \\
&= f(\b{x}^*) + \frac{df}{dG_{\perp}} (G - G^*)
\end{aligned}\]
<p>It is as if there was a coordinate system \((G, G_{\perp})\) on space, where the \(k\) constraints give coordinates on all the level sets and the other \(G\) variables are unspecified. Then \(L(\b{x}, \frac{df}{dG_{\perp}})\) is \(f\), expanded only in terms of the \(G\) coordinates.</p>
<hr />
<h1 id="4-the-meaning-of-l-part-2-the-envelope-theorem-and-legendre-transforms">4. The meaning of \(L\), part 2: the Envelope Theorem and Legendre Transforms</h1>
<p>But this can’t be the full story, because \(L\) is a function of \(\vec{\lambda}\) in general, and we have to solve for that also: the solution is given by both \(L_{\b{x}} = 0\) and \(L_{\vec{\lambda}} = 0\). Why?</p>
<p>I don’t know that I understand the full reason, but here are some observations.</p>
<p>First, it seems to be meaningful to do calculations directly on \(L\), with \(\vec{\lambda}\) unspecified, due to the weird way that maximizing functions interacts with calculus. There is a thing called the <a href="https://en.m.wikipedia.org/wiki/Envelope_theorem">envelope theorem</a> which is mostly used by economists. It says: suppose you want to know how the optimal value \(f(\b{x}^*)\) depends on some parameter \(\alpha\), where all the parts of the system can be changed by changing \(\alpha\) (which could be vectorial). So we are solving \(f(\b{x}, \alpha)\) subject to the constraint \(G_{\perp}(\b{x}, \alpha) = 0\) (we’ll fold the \(G_{\perp}^*\) term into the \(\alpha\)). Then the derivative \(d f^*/d\alpha\) is given by taking a partial derivative of \(L^*\) instead:</p>
\[\frac{df^*}{d \alpha} = \frac{\p L^*}{\p \alpha} = \frac{\p f(\b{x}^*, \alpha)}{\p \alpha} - \vec{\lambda} \cdot \frac{\p G_{\perp}(\b{x}^*, \alpha)}{\p \alpha}\]
<p>The solution point \(\b{x}^*\) is a function of \(\alpha\) also, at least in principle, but the Envelope Theorem’s statement is that this doesn’t matter. The argument goes like this: we’ll compute how \(f^* = f(\b{x}^*(\alpha), \alpha)\) change as we change \(\alpha\), given that the constraint \(G_{\perp} (\b{x}, \alpha) = 0\) always holds. We find that:</p>
\[\begin{aligned}
\frac{d f^*}{d \alpha} &= \frac{\p f}{\p \alpha} + \frac{\p f}{\p \b{x}} \frac{d \b{x}^*}{d \alpha} \\
&= \frac{\p f}{\p \alpha} + \vec{\lambda} \cdot \frac{\p G_{\perp}}{\p \b{x}} \frac{d \b{x}^*}{d \alpha} \\
&= \frac{\p f}{\p \alpha} + \vec{\lambda} \cdot [\cancel{\frac{d G_{\perp}}{d \alpha}} - \frac{\p G_{\perp}}{\p \alpha}] \\
\frac{\p L^*}{\p \alpha} &= \frac{\p f}{\p \alpha} - \vec{\lambda} \cdot \frac{\p G_{\perp}}{\p \alpha} \\
\end{aligned}\]
<p>(Where we have used the fact that \(\frac{d G_{\perp}}{d \alpha} = \frac{\p G_{\perp}}{\p \b{x}} \frac{d\b{x}^*}{d \alpha} + \frac{\p G_{\perp}}{\p \alpha} = 0\). Also I’ve written \(\frac{\p f}{\p \b{x}}\) for \(\del_{\b{x}}\) for consistency.) The surprising part is that the \(d \b{x} / d \alpha\) terms cancel out, as a result of enforcing that we keep \(G_{\perp}(\b{x}(\alpha), \alpha) = 0\). And also that it doesn’t matter what the value of \(\vec{\lambda}\) is.</p>
<p>I <em>think</em> the reason this works is that, even before we know \(\vec{\lambda}\), we know that it’s going to be a fixed value \(\vec{\lambda}^*\) when we find the optimum point. That means that it’s okay to treat it as a fixed value ahead of time, and then we can do calculus on \(L^*\) without worrying about it having its own partial derivatives.</p>
<p>So I guess that is a pretty compelling reason to use \(L\): it’s a version of \(f\) that you can do correct calculus on. By <em>assuming</em> that we can remove the projection onto \(G_{\perp}\) by finding some appropriate value of \(\vec{\lambda}\), we can do calculus, even before we know the value of \(\vec{\lambda}\).</p>
<p>(Incidentally, I find this kind of calculation really confusing. Even after writing it out I’m not sure I really get it. It is like there is a secret rule of calculus for commuting \(\frac{d}{d \alpha}\) and \(\text{max}_{\b{x} \in G_{\perp}(\alpha)^{-1}(0)}\) that I don’t really understand.)</p>
<hr />
<p>Another thought on \(\vec{\lambda}\) as an indeterminate. When we go to solve \(d_G f = 0\) by solving \(d L = 0\), we end up solving a system of equations:</p>
\[\begin{aligned}
\del_{\b{x}} L &= 0 \\
\del_{\vec{\lambda}} L &= 0 \\
\end{aligned}\]
<p>And the solution is given by the intersection of the two solutions. The first equation is all the points that have \(d_G f = 0\) on <em>any</em> level set. The second equation is all the points that solve the constraint with the actual values we were looking for, \(G_{\perp} = G_{\perp}^*\). It Well, it is probably quite convenient that the resulting equation is symmetric in \(\b{x}\) and \(\vec{\lambda}\). The solution is the intersection of two surfaces, and that operation doesn’t care which one is a position coordinate and which one is a derivative \(\frac{df}{dG_{\perp}}\).</p>
<p>Earlier we saw that the \(G_{\perp}\) constraints can be regarded as coordinates themselves, which parameterize points in space in terms of the constraints values \(G_{\perp}\) instead of the positions \(\b{x}\), and we imagine writing</p>
\[f(\b{x}) = f(G, G_{\perp})\]
<p>as a way of factoring \(f\) into \(k\) variables that specify which level set we’re on, plus \((n-k)\) variables that say where on it to go. Well, when we write \(L(\b{x}, \vec{\lambda})\), we have in a sense changed the \(G_{\perp}\) variables out for a new set of variables \(\vec{\lambda}\). So in a way we have switched to solving solving for a new set of \(n\) variables:</p>
<ul>
<li>\(k\) values of \(\vec{\lambda}\) which tell you not which \(G_{\perp}\) to end up on but what the derivatives should be when you get there.</li>
<li>\(n-k\) values of \(\b{x}\) restricted to the surface \(G\) which tell you where on the surface to go.</li>
</ul>
<p>Of course you do not actually know how to turn the \(n\) values of \(\b{x}\) into the \(n-k\) values of the imaginary coordinates on \(G\), but at least in principle there are only \(n-k\) degrees of freedom there.</p>
<p>In this perspective the Lagrangian looks like a <a href="https://en.wikipedia.org/wiki/Legendre_transformation">Legendre Transform</a> in the \(G_{\perp}\) variables. For a function of two variables \(f(x,y)\), doing a Legendre transform in the \(y\) variable looks like</p>
\[\tilde{f}(x, p) = p y - f(x, y) \|_{y = (f_y)^{-1}(p)}\]
<p>Where \(p = \frac{df}{dy}\) is a value of the derivative. The resulting function \(\tilde{f}(x, p)\) is like \(f\) but parameterized by \((x, p) = (x, \frac{\p f }{\p y})\) rather than \((x,y)\).</p>
<p>For our optimization problem, we imagine writing the function \(f\) out in the \((G, G_{\perp})\) coordinate system. Then \(L\) is something like the Legendre transform of the \(G_{\perp}\) variables only, although with the sign flipped (which I guess is arbitrary), and in particular the transform happens with regard to the offset variable \(\Delta G_{\perp} = G_{\perp} - G_{\perp}^*\) instead of just \(G_{\perp}\) (which is just to make the notation match up, and shouldn’t make a difference):</p>
\[L(G, \vec{\lambda}) = f(G, G_{\perp}) - \vec{\lambda} \cdot \Delta G_{\perp} \|_{\Delta G_{\perp} - G_\perp^* = \del (\Delta G_{\perp})^{-1}(\vec{\lambda})}\]
<p>Then we switch back to the \(\b{x}\) variables.</p>
\[L(\b{x}, \vec{\lambda}) = f(\b{x}) - \vec{\lambda} \cdot (G_{\perp}(\b{x}) - G_{\perp}^*)\]
<p>This is pretty sketchy (particularly assuming that \(\b{x}\) is effectively \((n-k)\) dimensional), and I’ve glossed over all the boring analytical details of Legendre transforms also. Nor is it really useful for intuition since Legendre transforms are just weird. But I think it is useful to see the connection: that the Lagrangian of Lagrange multipliers is, in a sense, the Legendre transform of the function into some new \(\vec{\lambda}\) variables. Each time you Legendre transform a variable, some new variables show up and some of the old ones become redundant. Hence how the new function \(L\) is in \((\b{x}, \vec{\lambda})\) which looks like \(n + k\) degrees of freedom, but \(k\) of the components in the \(\b{x}\) aren’t actually free to vary.</p>
<p>In particular this might be interesting if you are also looking at the Legendre transforms of analytical mechanics. Mechanics also uses a notion of the Lagrangian which is <em>more or less</em> the same thing as the one from Lagrange multipliers. Constraints on motion enter as literal Lagrange multiplier terms, where the multiplier is the value of the normal force that holds you to the constraint. Meanwhile the exchange of energy between potentials and velocity becomes another constraint that gives the \(L = T - V\) form. (…I think. That’s for another day..)</p>
<p>Anyway, in mechanics one takes the Legendre transform in all of the velocity variables in \(L\) to produce the <a href="https://en.wikipedia.org/wiki/Hamiltonian_(control_theory)">Hamiltonian</a> (“energy function”) \(H = L_{\dot{x}} \dot{x} - L\) (or in some of the variables, to produce the <a href="https://en.wikipedia.org/wiki/Routhian_mechanics">Routhian</a>, yikes).<sup id="fnref:control" role="doc-noteref"><a href="#fn:control" class="footnote" rel="footnote">8</a></sup> There is a <a href="https://physics.stackexchange.com/questions/790569/derivation-of-hamiltonian-by-constraining-lq-v-t-with-v-dotq">way</a> of looking at the Hamiltonian as what happens when you enforce the constraint that \(\dot{x} = v\): the momentum \(p\) becomes a Lagrange multiplier which tells you how \(L\) changes as you change the value of \(v\) and then require that \(\dot{x} - v = 0\) is true.</p>
<p>I find this perspective helpful: that maybe all the Lagrangian <em>ever</em> was was the Legendre transform of something else. There was in the first place an optimization problem, which in physics was “maximize action”, that is, maximize \(dS = -m \d \tau = - (T \d t - p \cdot d \b{x}) = - T \d t + 2 T \d t = T \d t\). But we had constraints, so we add them in, and each one takes the form of a Legendre transform term. \(F = ma\) as a constraint has multiplier \(\frac{dS}{dF} = d \b{x}\) giving \(- V_x d \b{x}\) which we write as \(- V\). A constraint \(g(\b{x}) = 0\) gives \(- F \cdot g\). And \(\dot{x} = v\) rewrites the whole thing in terms of \(p = L_{\dot{x}}\).</p>
<p>Something like that. I guess there’s a lot more to figure out. Oh well.</p>
<p>In summary:</p>
<ol>
<li>\(L\) is a meaningful function, not just a bookkeeping trick. In particular it is like a linearization of \(f\) but only on \(G\), such that it is forced to obey the constraint.</li>
<li>\(L\) acts like a Legendre transform of \(f\) with respect to the \(G_{\perp}\) variables. We can think of space as being parameterized by \((G, G_{\perp})\) and then using \(L\) we write \(f\) as a function of \((G, \frac{df}{dG_{\perp}})\) instead.</li>
</ol>
<hr />
<h1 id="5-the-meaning-of-veclambda">5. The Meaning of \(\vec{\lambda}\)</h1>
<p>In practice, the values of \(\lambda^* = \frac{df^*}{dG^*}\) are meaningful and interesting. After all these constraint problems are about imposing some condition on a system and then trying to optimize it. Well, the derivatives \(\frac{df}{dG}\) tell you how the system’s value changes when you vary the constraint values, and the optimum value \(\lambda^* = \frac{df^*}{dG^*}\) tells you how it will happen at the maximum.</p>
<p>Here are some examples:</p>
<p>In economics, if the function being maximized is something like “profit” and the constraint is some condition on the business that forces \(g = g^*\), then the multiplier \(\lambda^*\) tells you how much more profit you will get if you can change that condition, and the highest multiplier tells you which change gives the highest marginal returns. (Wikipedia tells me that \(\lambda\) is called the <a href="https://en.wikipedia.org/wiki/Marginal_cost">marginal cost</a> or <a href="https://en.wikipedia.org/wiki/Shadow_price">shadow price</a>)</p>
<p>In physics, if a system is constrained to follow a certain surface in space given by \(g = g^*\), then the Lagrange multiplier for that system is the normal fore which holds the system to that surface. Also there is a sense, alluded to above but which I’ll have to work out some other day, in which \(L = T - V\) is the Lagrange multipler form of \(F = ma\): you can move in the potential, but there’s a cost, and the multiplier is a force \(F\).</p>
<p>Also in physics, if two systems are in thermodynamic equilibrium, then they are modeled as maximizing entropy subject to their total energy being constant. You find that the Lagrange multiplier \(\frac{dS}{dE} = \frac{1}{T}\), that is, inverse temperature is the multiplier for energy. Many of the other variables which characterize various types of exchanges of entropy between two systems can be modeled as multipliers: pressure \(P\) is the multiplier for changes in volume \(V\), chemical potential is the multiplier for changes in particle ratio; presumably there are many others.</p>
<p>In each case the multiplier acts like a “force” or “pressure” or “marginal utility”. It tells you how the maximum value changes when you change some parameter of a system. There’s a very general way of looking at it:</p>
<ol>
<li>Suppose there is a system that optimizes some quantity. For thermodynamics, it’s \(S\), the entropy.</li>
<li>If you split the system in two parts that exchange some conserved quantity \(E = E_1 + E_2\), then the individual systems still maximize \(S = S_1 + S_2\). They do <em>not</em> maximize their \(S_i\) because they should have to give away some \(S\) if it makes the other system’s \(S\) go up by more.</li>
<li>Therefore at their maximum, there’s a balance between the two values of \(\frac{dS}{dE}\), and the two systems end up maximizing \(S_i - \frac{dS_i}{dE_i} (E_i)\) instead. The quantity \(\lambda = \frac{dS_i}{dE_i}\) is the Lagrange multiplier. For thermodynamics it turns out to be the the inverse temperature, \(\frac{1}{T}\), so the maximized value for each subsystem is \(S - \frac{1}{T} E\) or \(TS - E\) instead (the negative<a href="https://en.wikipedia.org/wiki/Helmholtz_free_energy"> Helmholtz free energy</a>).</li>
<li>When the multiplier \(\lambda\) is treated as a free variable, then you are leaving the rate of exchange with other systems open to vary: for instance, you solve for a system which is maximizing \(S\) while held to some temperature \(T\). Probably it is connected to a much-larger reservoir that can take or give it energy if its value of \(T\) ever changes. As far as the system is concerned, its job is to maximize \(TS - E\) for fixed \(T\) and variable \(E\).</li>
<li>Something like this holds for <em>any</em> number of systems that can exchange <em>any</em> quantity while optimizing for something else. In thermo there are versions of this for every choice of quantities that might you fix versus vary; for instance <a href="https://en.wikipedia.org/wiki/Gibbs_free_energy">Gibbs Free Energy</a> is the maximized quantity if temperature and pressure are fixed but volume and energy are allowed to vary.</li>
</ol>
<p>So for any coupled set of systems that optimize a variable \(S\) there is some quantity \(\lambda\) which measures the “marginal value of giving up a unit of \(S\) to someone else”. If another system can get more optimum out of a quantity such as a unit of energy, the system we’re looking at would rather give up a unit to the other system than use it itself. The only way the whole system ends up at a maximum is if everyone’s values of \(\lambda\) equilibrate: otherwise, there is still an improvement to be made.</p>
<p>This argument is <em>super</em> general. Even in the abstract optimization problem that we’ve been considering, it’s there: \(\lambda\) tells us how the system <em>could</em> change, were we allowed to change the value of \(G_{\perp}^*\). Give me another system to get values of \(G_{\perp}^*\) from and I can use it to get more value out of \(f\). That is why we care about the value of \(\lambda\) even though it gets cancelled out in \(L\): because in every situation in practice, there <em>are</em> externalities and constraints that might be changed, and \(\lambda\) tells us how it works.</p>
<p>Suppose you are a strangely calculating, rational person, and you’re trying to make a decision in your life, like picking a job and place to live that optimizes your happiness \(H\). Maybe your job could be improved and your housing could be improved. Well, there is a multiplier: if taking a worse house makes your job better by <em>more</em> than that, you should switch houses to get the better job. Or vice-versa. Or, since you know the value of \(\lambda = \frac{dH}{dC}\), the return on happiness from making those changes, you should look for other places to make changes that get a better return: change your relationships, get a new hobby, start a rebellion, whatever it takes. Maybe you get a certain negative return on happiness from even <em>thinking</em> about the problem for a time \(t\) due to some rate of stress \(S = \frac{dH}{dt}\). Now you shouldn’t be thinking about optimizing \(H\) at all, but rather \(H - S t\): make your decisions, but don’t spend your whole life stressing about them either because you’ll eat into the reward that way.</p>
<p>That’s right. Lagrange Multipliers are everywhere. They’re basically a moral philosophy, hiding under a not-very-compelling pile of poorly-understood calculus hacks. Sheesh.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:covariant" role="doc-endnote">
<p>There are various versions of the covariant derivative at different levels of abstraction. This is the simple one from classical differential geometry. The more well-known one is from Riemannian geometry, which is used when the surface isn’t floating in a larger space like this. It’s used for general relativity. There’s another one for tensor bundles which is used when you’ve attached some other function \(A\) to every point in space and want to ask how \(A\) is shaped without actually being able to see it. That one is widely used in quantum mechanics. They are both a lot more difficult to think about than this one. <a href="#fnref:covariant" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:zero" role="doc-endnote">
<p>Note that if \(\del g\) happened to be \(0\), meaning that \(g(\b{x})\) is locally constant, then \(\del f\) has to equal \(0\) as well: it once again has \(n\) constraints instead of \(n-1\), and there’s no free parameter; it has to be an actual stationary point of \(\del f\) on its own. We’re not dealing with this case, as we’ve assumed that \(\del g \neq 0\). But it’s worth thinking about. If it was zero anyway then \(G\) would switch from being \((n-1)\)-dimensional to \((n)\)-dimensional, so a volume instead of a plane in \(\bb{R}^3\). <a href="#fnref:zero" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:gen" role="doc-endnote">
<p>I use “generalized inverse” to refer to the preimage of an operation, but written as an algebraic object that includes free parameters as necessary to be correct. For instance the generalized inverse of \(0 \times a = 0\) is \(a = \lambda\), meaning any real number. See also <a href="/2023/09/25/inverses.html">this other post</a>. <a href="#fnref:gen" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:gs" role="doc-endnote">
<p>Incidentally the projections \((\del g_1, \del g_2 - \proj_{\del g_1} \del g_2, \ldots)\) here would, after you normalize them, form a <a href="https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process">Gram-Schmidt basis</a> for \((\del g_1, \del g_2)\). But that’s not the best way to do it. <a href="#fnref:gs" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:indices" role="doc-endnote">
<p>To be precise we’d want to notate which indices of the matrices are contracted with each other, but let’s not. Anyway there is really only one sensible way to do it. <a href="#fnref:indices" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:confuse" role="doc-endnote">
<p>I am far from the <a href="https://math.stackexchange.com/questions/1392280/lagrange-multiplier-method-why-is-the-langragian-function-defined-as-fx-y-l">only one</a> confused by this, and as often happens, all the repliers on that question are confused by the fact that the questioner is confused about it. But I agree with the questioner: “noticing” you can write \(L = f - \lambda g\) and optimize \(\del L = 0\) instead is just weird. There needs to be some sort of elegant reason for why that works. <a href="#fnref:confuse" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:frame" role="doc-endnote">
<p>The matrices \(d G\) and \(d G_{\perp}\) can be thought of as <a href="https://en.wikipedia.org/wiki/Frame_of_a_vector_space">frames</a>, which are like bags of arbitrary numbers of vectors, and their pseudoinverses \(\frac{1}{d G}\) and \(\frac{1}{d G_{\perp}}\) are their dual frames. <a href="#fnref:frame" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:control" role="doc-endnote">
<p>There is also a whole separate version of this in <a href="https://en.wikipedia.org/wiki/Hamiltonian_(control_theory)">control theory</a> <a href="#fnref:control" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Delta Functions via Inverse Differentials2024-03-12T00:00:00+00:00https://alexkritchevsky.com/2024/03/12/indicators<p>Here’s a dubious idea I had while playing with using delta functions to perform surface integrals. Also includes a bunch of cool tricks with delta functions, plus some exterior algebra tricks that I’m about 70% sure about. Please do not expect anything approaching rigor.</p>
<!--more-->
<hr />
<p>But first, some notations.</p>
<h3 id="notations">Notations</h3>
<p>Our equations will involve lots of distributions, particularly \(\delta\) (the Dirac delta function), \(\theta\) (the Heaviside step function), and \(I\) (the indicator function). These get quite verbose if you write them all out with their arguments like \(\delta(x-a)\), so I will be using some shorthands to make things easier to read:</p>
<p>We will omit the arguments for functions and distributions when the meaning is clear from context: \(f\) will be written instead of \(f(x)\) when there it is obvious that the argument is in the variable \(x\).</p>
\[f(x) \equiv f\]
<p>(I will be using \(\equiv\) to mean that something is defined as being equal to something else, as opposed to \(=\) which means they are algebraically equal.)</p>
<p>We’d often like to also be able to omit an argument when it’s of the form \((x-a)\) as well, since this case is very common for distributions. We do this by moving the \(a\) into a subscript, writing \(\delta_a\) for \(\delta(x-a)\). In practice we’ll sometimes write this as \(\delta_{(a)}\) instead, to make it clear that \(a\) refers to a point.</p>
\[\begin{aligned}
\delta(x-a) &\equiv \delta_a \equiv \delta_{(a)} \\
\theta(x-a) &\equiv \theta_a \equiv \theta_{(a)} \\
\end{aligned}\]
<p>An indicator function \(1_{P}(x)\) is equal to \(1\) anywhere that the predicate \(P(x)\) is true, even if we omit the \((x)\). For instance \(1_{x = a}\) is \(1\) if \(x=a\) and \(0\) otherwise. We’ll also omit the \(x\) in the predicate and just write this as \(1_a\) to mean the indicator for the point \(x=a\). We also generalize this and allow the subscript to be other types of surfaces: an interval \((a,b)\), or a generic surface \(A\).</p>
\[\begin{aligned}
1_{(a,b)} &\equiv 1_{x \in (a,b)} \\
1_a &\equiv 1_{(a)} \equiv 1_{x = a} \\
1_A &\equiv 1_{x \in A}
\end{aligned}\]
<p>The generic surface is written with a capital letter to distinguish it from a point. We’ll also write integrals over an interval or a surface with capital letter when the details of the surface don’t matter, like \(\int_I \d f\). Then the result of the integral is to evaluate \(f\) on the boundary of \(I\), which is written \(\p I\). If \(I\) is the interval \((a,b)\) then \(\p I\) is the pair of points \((b) - (a)\).</p>
\[\int_I \d f = f \|_{\p I}\]
<p>When we use the interval notation \(1_{(a,b)} \equiv 1_{x \in (a,b)}\), it is convenient to define \(1_{(b,a)} = -1_{(a,b)}\). Therefore we would really like a sort of “oriented indicator function” instead which can take the value \(\pm 1\) depending on whether \(a<b\) or \(a > b\). Bolding this becuase it’s important: <strong>in this article an indicator function over a range is understood to be an oriented indicator function</strong>, rather than the usual indicator of mathematics which always has the value \(1\).</p>
\[1_{(a,b)} \equiv
\begin{cases}
+1 & a < b \\
-1 & a > b
\end{cases}\]
<p>We can take the subscript notation further by also allowing a subscript to contain a linear combination of surfaces, like \(1_{A + B} = 1_A + 1_B\). In every case the linear combination simply distributes over the function, so \(\delta_{A+B-C} = \delta_A + \delta_B - \delta_C\), etc. When we write linear combinations of points, we write the points with parentheses, as \((a) + (b)\), so that it doesn’t look like we’re adding the points as vectors \(a + b\).</p>
\[\begin{aligned}
1_{(a) + (b)} &\equiv 1_a + 1_b \\
1_{(b) - (a)} &\equiv 1_b - 1_a \\
\delta_{(b) - (a)} &\equiv \delta_b - \delta_a \\
\theta_{(b) - (a)} &\equiv \theta_b - \theta_a \\
\end{aligned}\]
<p>It’s occasionally nice to also permit this notation for integrals. If \(A\) and \(B\) are two surfaces, then</p>
\[\int_{A - B} df \equiv \int_A \d f - \int_B \d f\]
<p>Finally, know that I am going to be using the words “function” and “distribution” interchangeably without concern for their usual technical differences.</p>
<hr />
<h1 id="1-integrating-with-distributions">1. Integrating with Distributions</h1>
<p>Presumably you are aware that a delta function can be “used” to evaluate a function at a point:</p>
\[\int_{\bb{R}} \delta_a f \d x \equiv \int_{\bb{R}} \delta(x-a) f(x) \d x = f(a)\]
<p>It turns out that a lot of other operations can be converted into integrals against distributions. For instance, an integral over a finite range \((a,b)\) can be written as integration against the indicator function for that range:</p>
\[\int_a^b f' \d x = \int_{\bb{R}} 1_{(a,b)} f' \d x = f(b) - f(a)\]
<p>And we can write an (oriented) indicator function over a range as the sum of two step functions, \(1_{(a,b)} = \theta_a - \theta_b\). Integrating against either form gives the same result:</p>
\[\begin{aligned}
\int_{\bb{R}} 1_{(a,b)} f' \d x &= \int (\theta_a - \theta_b) f' \d x \\
&= \int_a^\infty f' \d x - \int_b^\infty f' \d x \\
&= [\int_a^b f' \d x + \cancel{\int_b^\infty f' \d x}] - \cancel{\int_b^\infty f' \d x}\\
&= \int_a^b f' \d x \\
&= f(b) - f(a)
\end{aligned}\]
<p>(This was the reason for using oriented indicator functions: defining \(1_{(a,b)} = -1_{(b,a)}\) makes it consistent with \(\theta_a - \theta_b = -(\theta_b - \theta_a)\).)</p>
<p>Another way of getting the result is by moving the derivative over onto the indicator function with integration-by-parts. Recall that \(\int_I u v' \d x = (u v) \|_{\p I} - \int_I u' v \d x\). Here \(I = \bb{R}\), so \(\p I = (-\infty, + \infty)\), and the boundary terms vanish because \(1_{(a,b)}\) is \(0\) at \(\pm \infty\).</p>
\[\begin{aligned}
\int_{\bb{R}} 1_{(a,b)} [\p_x f] \d x &= \cancel{1_{(a,b)} f \|_a^b} - \int_{\bb{R}} [\p_x 1_{(a,b)}] f \d x \\
&=\int_{\bb{R}} [(-\p_x) (\theta_a - \theta_b)] f \d x \\
&= \int_{\bb{R}} (\delta_b - \delta_a) f \d x \\
&= f(b) - f(a) \\
\end{aligned}\]
<p>The \(f' = \p_x f\) passes its derivative over to \((-\p_x) 1_{(a,b)}\), and the boundary terms vanish because \(1_{(a,b)}\) is \(0\) at \(\pm \infty\). The result is a pair of delta functions defined on the boundary of the underlying oriented surface \(\p(a,b) = (b) - (a)\). Note that the order of \(b\) and \(a\) switch because of the negative sign from the integration-by-parts: \((-\p)(\theta_b - \theta_a) = \delta_a - \delta_b\).<sup id="fnref:boundary" role="doc-noteref"><a href="#fn:boundary" class="footnote" rel="footnote">1</a></sup></p>
<p>This is all pretty neat to me. It’s intuitive if you think about it, but I didn’t notice it for a long time after I learned about \(\delta\). It turns out that a lot of properties of integration can be moved into the integrand itself by writing them as distributions. We’ll be doing a lot more of that in a minute.</p>
<p>First, though, consider how the resulting object \(\delta_{(b) - (a)}\) looks like a description of the boundary \((b) - (a)\), which makes \((-\p_x )\) look like an expression of the boundary operator, implemented on the distribution representation of \((a,b)\) that is given by \(1_{(a,b)}\). But it’s not <em>quite</em> right. The appropriate description of the boundary \(\p(a,b) = (b) - (a)\) should probably also be an indicator like \(1_{(b) - (a)}\), which is <em>not</em> the same thing as \(\delta_{(b) - (a)}\). Indicators have value \(1\) at their nonzero points, while \(\delta_{(b) - (a)}\) has value \(\infty\)-ish (delta functions do not really “have” a nonzero value but you know what I mean). Meanwhile, \(\delta_{(b) - (a)}\) is something you can actually integrate against, while \(1_{(b) - (a)}\) has measure zero, so integrating against it would give zero:</p>
\[\int 1_{(b) - (a)} f \d x \stackrel{!}{=} 0\]
<p>So what’s going on? What <em>is</em> the relationship between these two objects?</p>
\[\begin{aligned}
1_{\p(a,b)} &= 1_{(b) - (a)} \\
&\stackrel{?}{\equiv} \\
(-\p) 1_{(a,b)} &= \delta_b - \delta_a
\end{aligned}\]
<p>Or even these two?</p>
\[1_{(a)} \stackrel{?}{\equiv} \delta_a\]
<p>What is the difference between an indicator for a point and a delta function for a point?</p>
<hr />
<h1 id="2-integration-with-inverse-differentials">2. Integration with Inverse Differentials</h1>
<p>I think the most intuitive answer is that the delta function may be regarded as an indicator function divided by the absolute value of a differential:</p>
\[\boxed{\delta_a \? \frac{1_a}{\| d x \|}}\]
<p>The \(1_{(a)}\) is a normal indicator function. The object \(\| dx \|\) is basically “the magnitude of \(dx\)”. Unlike \(dx\), it is always positive when evaluated on a tangent vector. And for some strange reason it is in a denominator, which we will have to get used to.</p>
<p>(I first heard this idea in a book called “Burn Math Class” by Jason Wilkes, although his version omits the absolute value which I am pretty sure has to be there. Then I forgot about it for a while, before reinventing it and then thinking I had come up with it myself. Oops. Anyway, maybe it’s a bad idea for a reason I haven’t thought of? Or maybe it is already a thing somewhere and I just haven’t come across it? Dunno. But the reason I’ve written an article about it is that, the more I play with it, the more sense it keeps making to me.)</p>
<p>The basic idea is that splitting up a delta function into two pieces like this allows those pieces to be used in algebra with some very natural-looking rules. But it takes some getting used to. This is how it works in an integral:</p>
\[\begin{aligned}
\int_I \delta_a f(x) \d x &= \int_I \frac{1_a}{\| d x \|} f(x) \d x \\
&= \int_I 1_a f(x) \frac{d x}{\| d x \|} \\
&= f(a) \sgn(I) \\
\end{aligned}\]
<p>It is convenient to define</p>
\[\widehat{dx} \equiv \frac{dx}{\| dx \|}\]
<p>Which is meant to be the “unit vector” version of \(dx\), akin to \(\hat{x} = \frac{\vec{x}}{\| \vec{x} \|}\) for regular vectors. If you are used to thinking of differential forms like \(\d x\) as functions from vectors to \(\bb{R}\), then its behavior is \(\widehat{dx}(\b{v}) = \frac{dx(\b{v})}{\| dx(\b{v}) \|}\). If you think of \(\d x\) in the Riemann-integral sense, as an infinitesimal interval \(x_{i+1} - x_i\), then \(\widehat{dx}\) is the <em>sign</em> of that interval \(\sgn(x_{i+1} - x_i)\) without its magnitude.</p>
<p>So we have</p>
\[\begin{aligned}
\int_I \delta_a f(x) \d x &= \int_I 1_a f(x) \widehat{dx} \\
&= f(a) \sgn(I) \\
\end{aligned}\]
<p>The idea is that \(\frac{dx}{\| dx \|}\) cancels out the magnitude of \(dx\), leaving only a “unit differential” \(\widehat{dx}\). We claim, because it seems to make sense, that the integral of a unit differential is trivial: it is simply \(\pm 1\) depending on the orientation of the range of integration. The resulting \(\sgn(I)\) is determined by whether the range of integration \(I\) was over a positively-oriented range, such as \((-\infty, \infty)\), versus a negatively oriented range like \((\infty, -\infty)\) (we assume that \(a \in I\) though).<sup id="fnref:sign" role="doc-noteref"><a href="#fn:sign" class="footnote" rel="footnote">2</a></sup> Typically we just assume that all 1d integrals are over positively-oriented ranges unless otherwise specified, in which case we could simply omit the sign and write \(f(a)\), but I’m trying to be careful now because it will matter more in higher dimensions.</p>
<p>The unit differential is necessary for this object to act like a delta function. Because an integral against a delta function gives \(\int_I \delta(x) \d x = \pm 1\) depending on the orientation of \(I\), we cannot fully cancel out the value of \(dx\); we have to keep its sign. So we needed to invent something which cancels out its magnitude but leaves the direction.</p>
<p>The actual integration step is supposed to be easy once the integrand is proportional to \(1_a \widehat{dx}\). The \(1_a\) reduces the integral to a single point, while the \(\widehat{dx}\) integrates out to give the sign of the integration range at that point. I guess we just trust that that is how it works:</p>
\[\int_{\bb{R}} 1_a f(x) \widehat{dx} = f(a)\]
<p>…but here’s some pseudo-theoretical justification anyway.</p>
<p>Often we implement integration as the limit of a Riemann sum, which decomposes the integration range into a bunch of oriented cells, each of which is described by a tangent vector \(\b{v}_i\) (which in 1d is often simplified to \(x_{i+1} - x_i\)). Then we evaluate \(f \d x\) on each of those tangent vectors and add up the result. In the limit this converges (for some well-behaved class of functions) to the definite value for the integral. We write this as \(\int_I f \d x = \lim \sum_{i \in I} f(x) \d x (\b{v}_i)\), where the limit takes the number of partitions to infinity.</p>
<p>In our scheme \(\| dx \|\) is an object that has \(\| d x \| (\b{v}) = \| d x(\b{v}) \|\) (similar to the integration measure in an arc-length integral), and \(\widehat{dx}\) is the object that has \(\widehat{dx} (\b{v}) =\frac{d \b{x} (\b{v})}{\| d \b{x} (\b{v}) \|}\), which in \(\bb{R}^1\) is simply \(\sgn (dx(\b{v}))\). In higher dimensions it will include a direction, but in \(\bb{R}^1\) there are only two possible directions, corresponding to \(\pm 1\).</p>
<p>Normally what allows the summation’s limit to converge to the integral value is that \(dx(\b{v}_i) \propto \| \b{v}_i \|\), so as the integration partitions’ size goes to zero with their total magnitude bounded by the length of the range, the sum of \(dx(\b{v}_i)\) is held proportional to that length. When using \(\widehat{dx}\) the value is \(\pm 1\), so obviously we can’t add up a bunch of these. Instead the only reason the integral “converges” is that the indicator \(1_a\) has limited the range to a single point, or a sum of a finite number of points, instead.</p>
<p>…probably. If I haven’t missed anything But I find it intuitive: each point in the indicator \(1_{(a)}\) selects a point at which the integrand is evaluated, and then at that point the resulting contribution to the integral is \(\widehat{dx}\) times the orientation of the range at that point, giving \(f(a)\).</p>
<hr />
<p>This construction is nice because it makes some of the common disclaimers that normally have to be made about \(\delta(x)\) really trivial:</p>
<ol>
<li>You can’t evaluate \(\delta(x)\) outside of an integral for the exact same reason that you can’t evaluate \(f(x) \d x\) outside of an integral: because it uses the symbol \(d x\) whose value comes from the integral. Yet you can do algebra with it, as long as you keep track of the \(d x\)s and $$| dx |$s appropriately.</li>
<li>\(\delta(x)\) doesn’t have a value at \(x=0\) because it depends on an invisible variable, \(1/\| dx \|\). The value is not exactly infinite: it’s “whatever is required to cancel out a \(dx\) and leave only its sign”.</li>
<li>You can’t multiply two delta functions in the same variable by each other, like \(\delta(x) \delta(x) = \frac{1_{x=0} 1_{x=0}}{\| dx \|^2 }\), because the two copies of \(\| dx \|\) aren’t going to cancel out a single \(dx\) in the numerator and will leave an overall factor of \(1/\| dx \|\) that you have no way to integrate.</li>
</ol>
<p>Also, compare this construction to a typical “nascent delta function” construction. Delta functions are often defined as the limit of a series of smooth functions whose properties integrals go, in the limit, to the behavior of a delta function. Usually the smooth functions are a Gaussian, square cutoffs, or some other \(\e \eta(x/\e)\) for an integrable \(\eta\) that has \(\int \eta \d x = 1\). But these, I think, are trying to express exactly the idea of \(\frac{1_a}{\| dx \|}\). They want to make something whose (1) integral, in the limit, converges to being nonzero at exactly a single point, and which (2) perfectly cancels out the value of \(dx\) at that point, except for its sign, integrating to \(\pm 1\). Well why not just write that directly? (Well, it does not solve for the main reason you might be using nascent delta constructions, which is that you are demanding things be rigorously constructed out of classical functions for some reason. But I’m not concerned about that.)</p>
<p>Also, it makes \(\delta\)’s change-of-variable rules obvious. For instance \(\delta(-(x-a)) = \delta(x-a)\) is given by</p>
\[\delta(-(x-a)) = \frac{1_{-x=-a}}{\|{-dx} \|} = \frac{1_{x=a}}{\| dx \|} = \frac{1_{x=a}}{\| d(x-a) \|} = \delta(x-a)\]
<p>And \(\delta(ax) = \delta(x)/\|a \|\) is given by</p>
\[\delta(ax) = \frac{1_{ax = 0}}{\| a \d x \|} = \frac{1_{x=0}}{\|a \| \| d x \|} = \frac{\delta(x)}{\| a \|}\]
<p>And in general:</p>
\[\begin{aligned}
\delta(g(x)) &= \frac{1_{g(x) = 0}}{\|d g(x) \|} \\
&= \sum_{x_0 \in g^{-1}(0)} \frac{1_{x_0}}{\| g'(x_0) \d x \| } \\
&= \sum_{x_0 \in g^{-1}(0)} \frac{1_{x_0}}{\| g'(x_0) \| \, \|\d x \|} \\
&= \sum_{x_0 \in g^{-1}(0)} \frac{\delta(x-x_0)}{\| g'(x_0) \| }
\end{aligned}\]
<p>So that’s neat.</p>
<p>Anyway, I don’t find the use of an extra \(dx\) in an integrand <em>that</em> strange. Here’s why:</p>
<p>We are very used to integrating integrands of the form \(dF = f(x) \d x\). But in full philosophical generality, an integrand could be written as \(dF = f(x, dx) = F(x + dx) - F(x)\). That’s an object that <em>perfectly</em> expresses the derivative of \(F\), rather than approximates it. It just so happens that in most cases we care about this can be written as a linear function in \(dx\), \(F(x + dx) - F(x) = f(x) \d x\), and then we can do calculus the normal way. But in some cases, such as when dealing with the derivative of a step function \(\theta(x)\), the value of \(F(x + dx) - F(x)\) depends not <em>linearly</em> on \(dx\), but on some other condition, such as whether \(0 \in (x, x + dx)\). In that case you end up with an integrand that is not proportional to \(dx\) but depends on it in some other way, which is how you get identites like \(\theta' = \delta\).</p>
<p>Well, extending that argument: for the general case of \(dF = f(x,dx) = F(x + dx) - F(x)\), there is nothing preventing it from having any kind of weird functional dependence on \(dx\). So why not \(\frac{1}{\| dx \|}\) or something else? Sure, it might be hard to figure out how to integrate something like \(dF = a \d x ^2 + b \d x + c\)… but it is still a reasonable object to think about. And in this case, we do have a way of integrating it; just, it’s an unfamiliar way. Fine with me!</p>
<hr />
<h1 id="3-the-multivariable-case">3. The Multivariable Case</h1>
<p>In the more dimensions this notation gives a lot of results for free, but there is a very important and weird caveat.</p>
<p>At first it seems like a product of two delta functions, which are each an inverse differential, should be turn into a product of two inverse differentials:</p>
\[\delta(x) \delta(y) \? \frac{1_{x=0}}{\| dx \|} \frac{1_{y = 0}}{\| dy \|}\]
<p>But this doesn’t work! The problem is, what if we have a product of two delta functions that overlap in direction, like this?</p>
\[\int \delta(x) \delta(x+y) f(x) \d x \d y\]
<p>In an integrand this should evaluate at the point that satisfies \(x=0\) and \(x+y=0\), meaning that \(x=y=0\) and the result is \(f(0, 0)\). But because \(\| d(x+y) \| = \sqrt{2}\), in the indicator notation we would get \(f(0, 0)/\sqrt(2)\) if we naively divide through by \(\| dx \| \| dx + dy \|\). That doesn’t work. The problem is that the denominator of \(\delta(x) \delta(x + y)\) should cancel out the magnitude of a \(dx \^ d(x+y) = dx \^ dy\) in the numerator. So it is very important that the denominator is this new notation become a <em>wedge product</em> of all the terms in the delta functions:</p>
\[\delta(x) \delta(x + y) \stackrel{!}{=} \frac{1_{x=0} 1_{y=0}}{\| dx \^ dy \|}\]
<p>Which means that its behavior in an integral is this:</p>
\[\begin{aligned}
\int_{\bb{R}^3} \delta(x) \delta(x + y) f(x,y) dx \^ dy &= \int_{\bb{R}^3} 1_{x=x+y=0} f(x,y) \, \frac{dx \^ dy}{\| dx \^ dy \|} \\
&=\int_{\bb{R}^3} 1_{x=y=0} f(x,y) \widehat{dx \^ dy} \\
&= f(0, 0) \\
&\neq \int_{\bb{R}^3} 1_{x=y=0} f(x,y) \, \widehat{dx} \^ \widehat{dy} \; \; \text{ (wrong!)}
\end{aligned}\]
<p>Weird, but as far as I can tell necessary? Basically, \(\delta(x) \delta(x+y)\) needs to cancel out the magnitudes of \(dx \^ d(x+y) = dx \^ dy\). Since the numerator combines with a wedge product, the denominator has to also. In general, since \(\int \delta(f) \delta(g) \d f \^ d g\) ought to equal \(\pm 1\), the delta functions need to be proportional to \(\frac{1}{df \^ dg}\), even if \(df\) and \(dg\) are not orthogonal (although they cannot be parallel or we’d end up dividing by zero).</p>
<p>This will take some getting used to. Evidently the denominators are not just scalars: they are actually something like “differential forms” as well. Maybe they are “negative-grade absolute differential forms”? Or maybe the object \(\delta(x) \delta(y)\) should be regarded as \(\delta^2(x,y)\) and therefore its denominator is a compound object \(\| d^2(x,y) \|\) from the start, and factoring it into \(\delta(x) \delta(y)\) only “works” when those terms are orthogonal directions? Or maybe delta functions really act like measures and it’s even more not okay to regard them as functions? Not sure. I really don’t know the best way to explain it.</p>
<p>In case you need more convincing, note that it is well-known (although somewhat hard to find) that the change-of-variables formula for a multivariable delta function with argument \(\b{u}(\b{x}): \bb{R}^n \ra \bb{R}^n\) is</p>
\[\delta(\b{u}(\b{x})) = \frac{\delta(\b{x} - \b{u}^{-1}(0))}{\| \det (\p\b{u} / \p\b{x}) \|}\]
<p>That is, the denominator is the determinant of the Jacobian (hate that name) of \(\b{u}\), \(\p\b{u} / \p\b{x}\), and a determinant is <em>not</em> the product of all the individual magnitudes. That is basically what we’re dealing with here as well, only we’ve factored \(\delta(x, x+y)\) as \(\delta(x) \delta(x+y)\), which makes this combining-with-\(\^\) behavior look more strange.</p>
<p>Anyway, we will have to live with this.</p>
<p>(Hopefully it goes without saying that I’m rather unsure of all this. But whatever, let’s see what happens.)</p>
<hr />
<p>Here’s what happens in an integral:</p>
\[\begin{aligned}
\int_V \delta^3(\b{x} - \b{a}) f(\b{x}) \d^3 \b{x} &= \int_V 1_{\b{a}} f(\b{x}) \frac{d^3 \b{x}}{\| d^3 \b{x} \|} \\
&= \int_V 1_{\b{a}} f(\b{x}) \, \widehat{d^3 \b{x}} \\
&= \sgn(V) f(\b{a})
\end{aligned}\]
<p>The \(\sgn(V)\) comes from whether the integration is performed over a positively- or negatively-oriented volume. (Note that \(d^3 \b{x}\) is just a shorthand for \(dx \^ dy \^ dz\). I prefer to not write this as \(dV\) because it can be useful to reserve \(V\) as the label of a <em>specific</em> volume, like we’ve done here, rather than all of space, since \(V\) may in general be oriented differently than \(d^3 \b{x}\) is.)</p>
<p>We can also integrate a 2d delta function in \(\bb{R}^3\). These turn some, but not all, of the terms in the differential into a unit differential.</p>
\[\begin{aligned}
\int_V \delta(x) \delta(y) f(\b{x}) \d^3 \b{x} &= \int_V 1_{x=y=0} f(\b{x}) \, \frac{dx \^ dy \^ dz}{\| dx \^ dy \|} \\
&= \int_V 1_{x=y=0} f(\b{x}) \, \widehat{dx \^ dy} \^ dz \\
&= \sgn(V_{xy}) \int_{V_z} f(0, 0, z) \, dz\\
\end{aligned}\]
<p>The sign is strange. There’s not really a canonical way to choose it. We need the overall integral when the \(z\) coordinate is completed to have the right sign, but really we could <em>either</em> take out a factor of \(\sgn(V)\) <em>or</em> change the orientation of the \(z\) integral. Consider the simplest case, where \(V\) is the product of three ranges, like \(V = [-\infty, \infty]^{3}\). Then we imagine “factoring” it into two parts, as \(V = V_{xy} \times V_z\), and we imagine that this factorization preserves its orientation. Then it is clear that we can either extract the overall sign of \(V\) in the first integral, or we can extract whatever sign we want for the \(V_{xy}\) integral so long as \(\sgn(V_{xy}) \times \sgn(V_z) = \sgn(V)\). Above I’ve allowed myself to assume that \(V_z\) is positively oriented afterwards, so all of the sign of \(V\) is captured in the \(V_{xy}\), but I admit that this is all pretty sketchy. And of course this will be weird when \(V\) is not a cuboid (that is, a rectangular prism). But it’s a decent mental model anyway.</p>
<p>And here’s a single delta:</p>
\[\begin{aligned}
\int_V \delta(x) f(\b{x}) \d^3 \b{x} &= \int_V 1_{x=y=0} f(\b{x}) \, \frac{dx \^ dy \^ dz}{\| dx \|} \\
&= \int_V 1_{x=0} f(\b{x}) \, \widehat{dx} \^ dy \^ dz \\
&= \sgn(V_x) \int_{ V_{yz} } f(0, y, z) \d y \^ dz\\
\end{aligned}\]
<p>Same deal with the signs again. There’s not a canonical way to do it; we have to pick the integration bounds of the result such that the overall orientation of \(V_x \times V_{yz}\) matches \(V\).</p>
<hr />
<h1 id="4-implicit-surfaces">4. Implicit Surfaces</h1>
<p>This gets more interesting when we deal with delta functions of generic surfaces.</p>
<p>A single delta composed with a function \(\delta(g(\b{x}))\) becomes an integral over a 2d implicit surface, the level set \(g(\b{x}) = 0\). We assume that \(g\) defines a regular surface, so \(\| dg \| \neq 0\) anywhere.</p>
\[\begin{aligned}
\int_V \delta(g(\b{x})) f(\b{x}) \d^3 \b{x} &= \int_V \frac{1_{g(\b{x}) = 0}}{\| d g(\b{x}) \|} f(\b{x}) \d^3 \b{x} \\
\end{aligned}\]
<p>The easiest way to solve this is going to be if we can write the numerator as \(d^3 \b{x} = dg \^ d^2 \b{w}\), where \(\b{w} = (w_1, w_2)\) becomes a pair of coordinates on the level set of \(g^{-1}(0)\). But in general we don’t have these coordinates. What can we do?</p>
<p>Well, we can cheat a bit. We know from exterior algebra that</p>
\[\star dg = dg \cdot d^3 \b{x}\]
<p>And, defining \(\Vert dg \Vert = \| \del g \|\) as the <em>actual</em> magnitude of a differential (that is, the scalar value, not a weird type of differential form):</p>
\[dg \^ \star dg = \Vert dg \Vert^2 d^3 \b{x} = \| \del g \|^2 d^3 \b{x}\]
<p>Example of these: \(\star dx = dy \^ dz\), so \(dx \^ \star dx = d^3 \b{x}\) and \((a \d x) \^ \star (a \d x) = a^2 d^3 \b{x}\).</p>
<p>So we can write</p>
\[\begin{aligned}
\int_V \delta(g(\b{x})) f(\b{x}) \d^3 \b{x} &= \int_V 1_{g = 0} f \, \frac{dg \^ \star dg}{\| \del g \|^2 \| dg \|} \\
&= \int_V 1_{g = 0} f \, \frac{\widehat{dg} \^ \star \widehat{dg}}{\| \del g \|} \\
&= \sgn(V_g) \int_{g^{-1}(0)} f \frac{\star \widehat{dg}}{\| \del g\|}
\end{aligned}\]
<p>Where \(\star \widehat{dg}\) is the two-form which is the Hodge dual of \(\widehat{dg}\).</p>
<p>I have no idea how to do that integral in general, but we can try it out on an easy surface that we know the parameterization for. \(\delta(r-R)\) describes the surface of a sphere in \(\bb{R}^3\). Then \(dr\) is the differential for that surface, and \(\star dr = d \Omega = r^2 \sin \theta \d\theta \^ d \phi\), because \(dr \^ d \Omega = d^3 \b{x}\). Helpfully, \(\Vert dr \Vert = \| \del r \| = 1\) (I had to double-check). Therefore:</p>
\[\begin{aligned}
\int_V \delta(r-R) f \d^3 \b{x} &= \int_V \frac{1_{r=R}}{\| dr \|} f \d r \^ d\Omega \\
&= \int_V 1_{r=R} f \, \widehat{dr} \^ d\Omega \\
&= \sgn(V) \int_{r=R} f(0, \theta, \phi) \d\Omega
\end{aligned}\]
<p>Since the \(\Omega\) coordinates are always oriented in a standard way, I’ve let the overall sign of \(V\) get handled by this one integral. This calculation also works out if we use a different implicit function for the sphere, e.g. \(\delta(r^2 - R^2)\) or \(\delta(\sqrt{r^2 - R^2})\), although keep in mind that \(\delta(r^2 - R^2) = \delta(r - \pm R)/(2 R)\) if you work it out.</p>
<p>We could also have written \(\delta(r-R)\) out in rectilinear coordinates, \(\delta(\sqrt{x^2 + r^2 + z^2} - R)\), with \(dr = (x \d x + y \d y + z \d z)/r\). Then we get the same answer, after a tedious but perhaps useful calculation:</p>
\[\begin{aligned}
\iiint_V \delta(r-R) f \d^3 \b{x} &= \iiint_V 1_{r=R} f \frac{d^3 \b{x}}{\| x \d x + y \d y + z \d z \|/r} \\
&= \iiint_V 1_{r=R} f \frac{dx \^ dy \^ dz}{\| x \d x + y \d y + z \d z \|/r} \\
&= \iiint_V 1_{r=R} f \frac{[x \d x + y \d y + z \d z] \^ [x \d y \^ dz + y \d z \^ dx + z \d x \^ dy]}{r \| x \d x + y \d y + z \d z \|} \\
&= \iiint_V 1_{r=R} f [\widehat{x \d x + y \d y + z \d z}] \^ \frac{x \d y \^ dz + y \d z \^ dx + z \d x \^ dy}{r} \\
&= \sgn(V) \oiint_{r=R} f \; \frac{x \d y \^ dz + y \d z \^ dx + z \d x \^ dy}{R} \\
&= \sgn(V) \oiint_{r=R} f \d \Omega
\end{aligned}\]
<p>(It <a href="https://math.stackexchange.com/questions/3843421/spheres-surface-area-element-using-differential-forms">turns out</a> that \((x \d y \^ dz + y \d z \^ dx + z \d x \^ dy) / R\) does equal \(d \Omega\). I had no idea.)</p>
<hr />
<p>There’s a simple objection to all this, which is: why bother? All of this works without any special formulas for delta functions. When you have an integral \(\int \delta(g(\b{x})) f \d^3 \b{x}\), it was always possible to factor it as \(\int \delta(g(\b{x})) f \frac{dg \^ \star \d g}{\| dg \|^2} = \int_{g =0} f [\star dg]/\|\del g \|^2\), or to apply a delta identity to \(\delta(g(\b{x}))\) to factor it first.</p>
<p>And, yeah, I suppose that works. I guess I prefer the new version because it boils the somewhat ad-hoc calculus of delta functions down into simpler objects, which better capture “what’s really going on”. But eh, if you don’t like it, that’s fine too. I am just enjoying seeing how it works (although I would be concerned if it led to any false conclusions—but I haven’t found any yet).</p>
<hr />
<p>Okay, what about products of more than one implicit function?</p>
\[\begin{aligned}
\int_V \delta(f(\b{x})) \delta(g(\b{x})) f \d^3 \b{x} &= \int \frac{1_{f = g = 0}}{\| df \^ dg \|} f \d^3 \b{x} \\
&= \int 1_{f=g=0} \, f \, \frac{\widehat{df \^ dg} \^ \star(df \^ dg)}{\| \del f \^ \del g \|^2} \\
&= \sgn(V) \int_{f=g=0} f \frac{\star(df \^ dg)}{\| \del f \^ \del g \|^2}
\end{aligned}\]
<p>The result is over the intersection of the zero level sets \(f\) and \(g\), assuming that \(df \^ dg \neq 0\) everywhere. (Once again I have to used the fact that \((df \^ dg) \^ \star(df \^ dg) = \Vert df \^ dg \Vert^2 d^3 \b{x} = \| \del f \^ \del g \|^2 \d^3 \b{x}\).) The sign term \(\sgn(V)\) assumes that the resulting 1-integral is chosen to be over a positively oriented range.</p>
<p>Well, it is easy enough to produce a differential for the surface (via \(\star (df \^ dg)\) times a normalization factor). But as usual I have no idea how you would actually use it, because in general you will not have any sort of coordinates available for the surface.</p>
<p>The one case where it is easy(ish) to use is when you have enough implicit equations that their intersection is a \(0\)-surface, i.e. a point-set.<sup id="fnref:pm" role="doc-noteref"><a href="#fn:pm" class="footnote" rel="footnote">3</a></sup> In that case you can find the \(0\)-set of the functions \(\{f(x), g(x)\, \ldots \}\) by whatever algebraic method you like, and then compute the integral that way. <a href="https://math.stackexchange.com/questions/619083/dirac-delta-function-of-non-linear-multivariable-arguments">Here</a> is an example problem (albeit in 2d) that I found on StackExchange:</p>
\[\begin{aligned}
& \int_{\bb{R}^2} \delta(x^2 + y^2 - 4) \delta((x-1)^2 + y^2 - 4) f(x,y) \d x \d y \\
&=\int \frac{1_{x^2 + y^2 - 4 = 0} 1_{(x-1)^2 + y^2 - 4}}{\| 2 x \d x + 2 y \d y\| \^ \| 2 x \d x - 2 \d x + 2 y \d y \|} f(x,y) \d x \^ d y \\
&= \int \frac{1_{(x,y) = (\frac{1}{2}, \pm \frac{\sqrt{15}}{2})}}{\| 4 y \d x \^ \d y \|} f(x,y) \d x \^ d y \\
&= \int \frac{1_{(x,y) = (\frac{1}{2}, \pm \frac{\sqrt{15}}{2})}}{\| 4 y\|} f(x,y) \widehat{\d x \^ d y} \\
&= \frac{1}{2\sqrt{15}} \big( f(\frac{1}{2},\frac{\sqrt{15}}{2}) + f(\frac{1}{2}, -\frac{\sqrt{15}}{2}) \big)
\end{aligned}\]
<p>Which is the right answer. Of course this is not much different from using the well-known delta function identity \(\delta(g(\b{x})) = \sum_{x_0 \in g^{-1}(0)} \frac{\delta(x-x_0)}{\| \del g(x_0) \|}\). But IMO it is at least easier to think about?</p>
<p>I suppose that the general problem of “finding the solution to systems of arbitrary equations” is a prerequisite to parameterizing them and integrating over them, and that is basically the field of algebraic geometry. So I’ll have to stop there and stick with just wondering about it for now.</p>
<hr />
<h1 id="5-stokes-theorem">5. Stokes’ Theorem</h1>
<p>We can also do Stokes’ Theorem. We’ll do the Divergence Theorem version of Stokes first because it is easiest to think about.</p>
<p>Suppose \(g(\b{x})\) is a well-behaved implicit function which is positive on the interior of a closed region \(V\). Write \(\b{n} = - \frac{\del g}{\| \del g \|}\) for the outward-pointing normal vector of \(V\). We can describe the \(3\)-surface \(V\) by a step function \(\theta(g(\b{x}))\)</p>
\[\theta(g(\b{x})) = \begin{cases} 1 & \b{x} \in V \\
0 & \text{ otherwise}\end{cases}\]
<p>And we can describe the \(2\)-surface \(\p V\) by its negative derivative</p>
\[(-\del) \theta(g(\b{x})) = - (\del g) \delta(g(\b{x})) = \| \del g \| \b{n} \, \delta(g(\b{x}))\]
<p>The divergence theorem says</p>
\[\int_{V} \del \cdot \b{F} \d V = \oint_{\p V} (\b{F} \cdot \b{n}) \d A\]
<p>Where \(\b{F}\) here is a vector field. Its divergence is \(d \b{F} = (\p_x F_{x} + \p_y F_{y} + \p_z F_{z}) d^3 \b{x} = (\del \cdot \b{F}) d^3 \b{x}\).</p>
<p>Then</p>
\[\begin{aligned}
\int_{g > 0} (\del \cdot \b{F}) \d V &= \int_{\bb{R}^3} \theta(g(\b{x})) (\del \cdot \b{F}) \d^3 \b{x} \\
&= \cancel{\int_{\p \bb{R}^3} \del(\theta \b{F}) \d^3 \b{x}} - \int [\del \theta(g(\b{x}))] \cdot F \d^3 \b{x} \\
&= \int \delta(g(\b{x})) [-\del g \cdot \b{F}] \frac{dg \^ \star dg}{\| \del g \|^2} \\
&= \int 1_{g=0} (\b{n} \cdot \b{F}) \frac{dg \^ \star dg}{\| \del g \|^2} \\
&= \int 1_{g=0} (\b{n} \cdot \b{F}) \; \widehat{dg} \^ \star \widehat{dg}\\
&= \oint_{g=0} (\b{n} \cdot \b{F} ) \; {\star \widehat{dg}} \\
\end{aligned}\]
<p>Which (as far as I can tell? this stuff is tricky) should be the correct area element on the surface. As always, not very helpful but I thought it was cool that it works. (I used the fact that integration by parts works with a scalar function times a vector field: \(G (\del \cdot F) = -\del G \cdot F\) so long as \(\del(GF)\) is zero at infinity, which it is because \(\theta(g) = 0\) outside of \(V\).)</p>
<p>That’s the classical version. The exterior calculus version is somewhat more elegant. In this, we treat \(F\) as a bivector field rather than a vector field, and we’re trying to get</p>
\[\int_{g > 0} dF = \int_{g =0} F\]
<p>We can imagine expanding \(F\) in a fictitious \((g, u, v)\) coordinate system that parameterizes the \(g> 0\) region, and regard \(F\) as a bivector field \(F = F_{uv} d u \^ dv + F_{vg} d v \^ dg + F_{gu} d g \^ du\). (If starting from a vector field, this is \(\star F\).<sup id="fnref:spherical" role="doc-noteref"><a href="#fn:spherical" class="footnote" rel="footnote">4</a></sup>) So the divergence is:</p>
\[dF = (\p_g F_{uv} + \p_u F_{vg} + \p_v F_{gu}) (dg \^ du \^ dv)\]
<p>The volume element \(d^3(g, u, v)\) is not necessarily of magnitude \(1\) in the ambient coordinates. Keeping track of all the “types” like this tells us exactly how to change coordinates if we need to.</p>
<p>When we integrate in parts in these coordinates, both the \(\p_u\) and \(\p_v\) derivatives will vanish because \(\theta = \theta(g)\) only. Also, there’s no extra \(\del g\) term because \(\p_g \theta(g) = \delta(g)\). It looks like this:</p>
\[\begin{aligned}
\int_{g > 0} dF &= \int \theta(g) [\p_g F_{uv} + \cancel{\p_u F_{vg} + \p_v F_{gu}}] \d g \^ du \^ dv \\
&= \int (-\p_g) \theta(g) F_{uv} \d g \^ du \^ dv \\
&= - \int \delta(g) F_{uv} \d g \^ du \^ dv \\
&= - \int 1_{g = 0} \, F_{uv} \widehat{\d g} \^ du \^ dv \\
&= \oint_{g = 0} F_{uv} \d u \^ dv \\
&= \oint_{g =0} F
\end{aligned}\]
<p>(Where’d the negative sign go? Well, \(dg\) points into the surface, not out of it, so I removed it when integrating over \(\widehat{dg}\) for consistency with the assumed orientation of \(du \^ dv\).)</p>
<p>Although to be honest I get really lost in some these exterior calculus computations so I wouldn’t vouch too heavily for this. But I do think this trick of “inventing coordinates for a surface, then writing down delta and step functions for it” is suspiciously powerful.</p>
<p>Incidentally, this type of integration is discussed on the Wikipedia page <a href="https://en.wikipedia.org/wiki/Laplacian_of_the_indicator">Laplacian of the Indicator</a>. It turns out that in some contexts it’s useful to take further derivatives of \(\delta(g)\) to produce \(\delta'(g)\) functions on surfaces.</p>
<hr />
<p>The same basic derivation should work for the other types of Stokes’ theorem, such as \(\int \del \times F \d A = \oint F \d \ell\) and \(\int_C \del F d \ell = \int_{\p C} F\). But I’m running out of steam so I’ll leave that for a later article.</p>
<hr />
<h1 id="6-summary">6. Summary</h1>
<p>Although my goal was to justify the funny-looking formula \(\delta(x) = \frac{1_a}{\| dx \|}\), but I ended up getting somewhat sidetracked playing around with using it to manipulate integrals in 3d. I guess the point is to just show that everywhere I’ve tried to use that notation, it is has proven rather natural and intuitive, so long as you remember that funny rule: that the differentials in the denominator combine with the wedge product, and are used to turn differentials in the numerator into “unit” differentials like \(\widehat{dg}\).</p>
<p>No idea if there’s any rigorous basis for any of it, of course. But I’m just glad to know how to produce some of the delta function identities more quickly now.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:boundary" role="doc-endnote">
<p>It seems like the distribution for the range \((a,b)\) is \(\theta_a - \theta_b\), and the distribution for the boundary is \(\delta_b - \delta_a\), created by the \(-\p\) operator, rather than \(+\p\) as you might guess. Why? I think it’s because, for a function like \(\delta_a\) or \(\theta_a\), the point \(a\) actually enters with a negative sign, in \(\theta(x-a)\). So if you wanted to take a derivative “with regard to the point \(a\)”, you would really want the object \(\p_a \theta_a\). It just happens that \(\p_a \theta_a = (-\p_x) \theta_a\), so the negative derivative \(-\p_x\) does the same thing as the positive derivative \(+\p_a\). <a href="#fnref:boundary" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:sign" role="doc-endnote">
<p>A slightly more sophisticated object would be something like \(\sgn_a(I)\) which measures “the sign of \(a\) in \(I\)” (which I have also <a href="/2019/02/23/exterior-6.html">seen written</a> as \(a \diamond I\)). The difference is that this would be \(0\) if \(a \notin I\). But I figure it’s probably not necessary to include that additional complexity here. <a href="#fnref:sign" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:pm" role="doc-endnote">
<p>Aside: it is good to think of the \(\pm\) symbol (or any discrete index, such as those created by \(\sqrt[n]{x}\)) as referring to coordinates on a \(0\)-surface. \(x^2 = 4\) is a “one-dimensional constraint in one dimension”; the resulting surface is zero-dimensional and consists of two points. <a href="#fnref:pm" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spherical" role="doc-endnote">
<p>It is somewhat non-trivial to see how this corresponds to the usual definition of divergence in curved coordinates. The important bit is to note that by writing the vector field as a bivector field we’ve already picked up some extra factors. For instance, in spherical coordinates, we have \(d^3 \b{x} = r^2 \sin \theta \d r \^ d \theta \^ d\phi\), and so \(F_{\theta \phi}\) is given by \(\star F_r (\b{r}) = (r^2 \sin \theta) F_r\). The total radial term ends up being \(\frac{1}{r^2 \sin \theta} \p_r [r^2 \sin \theta F_r] = \frac{1}{r^2} \p_r (r^2 F_r)\). <a href="#fnref:spherical" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
The Case Against Geometric Algebra2024-02-28T00:00:00+00:00https://alexkritchevsky.com/2024/02/28/geometric-algebra<p>Normally I haunt comment sections and leave replies on discussions about Geometric Algebra: “Wait, wait, although GA is clearly onto something it’s not as good as those people say, there’s something wrong with it; what you probably want is the wedge product on its own!” Which is not especially productive and probably a bit unhinged. So, today I’m going to actually make those points in one central place that I can link to instead.</p>
<p>To be clear, I’m not opposed to GA per se. What I (and, I think, a lot of other people) have a problem with is that the subject is pretty clearly flawed… and that the culture around it does not seem to realize this or be interested in addressing those flaws.</p>
<p>In particular, <strong>Hestenes’ Geometric Product is not an intuitive operation that we should be basing all of geometry on</strong>. And GA’s tendency to do so and then say “yes, this is <em>the</em> way that geometry should be done” (with a sort of religious zeal) is problematic and offputting. It’s also just ineffective: treating certain models as if they are somehow canonical and obvious and perfect is wrong, mathematically and socially, and it puts people off right from the start. There probably is a place for the geometric/Clifford product in a grand theory of geometry, but it’s not “front-and-center” like GA treats it today, and as a result the theory is a lot less compelling than it could be.</p>
<!--more-->
<p>If something like GA is to succeed, it will need to be improved. It will need to fix the problems with establishment mathematics better than it does now, in a way that everyone can get behind. Today it helps sometimes, but often misses the mark, and people who can see that are alienated by the lack of self-awareness about this. As a result the movement ends up accumulating a particular type of… zealous person… and loses almost everyone else. As a result its relationship to mainstream mathematics is tenuous: it is considered a kooky, crackpotty sideshow. And it <em>really</em> doesn’t help that a lot of GA proponents write about it in a pseudoreligious way, as if the reason GA is not mainstream is that they are being oppressed by close-minded traditionalism (when in fact they are not just very compelling at the moment).</p>
<p>Yet GA does find more enthusiasts every year, and it will take those new people a while to realize the things a lot of us have realized, and in the meantime they will go on selling other people on GA and repeating the cycle. As a result GA is, I think, stuck in a sort of perpetual mediocrity. My purpose in writing this is to push it to improve and address those problems.</p>
<p>The rest of this article will be substantiating these points. It’s very long. That’s because… I don’t know, nobody else seems to be complaining about GA even though a lot of people have the same basic reaction to it, so I wanted to sort of provide all of the counterarguments in one place? And also because I got carried away. But I want to emphasize that, although this is my own long and opinionated rant, I am far from the only person who feels this way. Most of the people who roll their eyes at GA do so and then move on and ignore it. I have decided to air my grievances in hopes that it might help with something: because I really <em>do</em> believe in the philosophical project that GA is trying to solve, just not in how it’s doing it right now.</p>
<hr />
<p>Disclaimer: I’m not a professional mathematician, and I think about math in a different/probably worse way than mathematics does. This is <em>not</em> going to be the case that a serious mathematician would make, which is probably something like “GA doesn’t say anything new so who cares?”. (It seems like the goals of research mathematics, that is, proving things, are rather divorced from the goals of people who use mathematics for practical purposes.) Also, since I do more-or-less subscribe to the underlying program of GA, I am at least slightly on the crank side of the fence as well. Take me seriously at your own risk.</p>
<hr />
<h1 id="1-a-lot-of-background-on-ga">1. A lot of background on GA</h1>
<p>It will be useful to understand what GA is, relative to the rest of math and physics, and how it got to be the way that it is, in order to establish exactly what it is that we’re disagreeing with here. So first I am going to lay out the rough narrative that I have in my head. It is mostly just stuff I’ve pieced together over the years, and I admit that I don’t really know how to check it against reality; I’d be happy to be corrected on any of this.</p>
<h3 id="what-is-geometric-algebra">What is Geometric Algebra?</h3>
<p><a href="https://en.wikipedia.org/wiki/Geometric_algebra">Geometric Algebra</a> is both a social movement and a branch of mathematics.</p>
<p>As a social movement, it’s a group of people who believe that (a) mathematics writing and pedagogy ought to be reformulated to be more useful, especially to its users who are not research mathematicians, and (b) this reformulation ought to be done in terms of a particular set of primitives. This would mean rewriting multivariable calculus, linear algebra, differential geometry, etc into a new language which is (somewhat) an alternative to vector and matrix algebra. The argument for doing this is that it would make a lot of math simpler, easier to understand, and easier to use for practical purposes.</p>
<p>As a branch of mathematics, it is a recasting of a subject called <a href="https://en.wikipedia.org/wiki/Clifford_algebra">Clifford Algebra</a> (CA), which is a somewhat-obscure descendent of the subject of <a href="https://en.wikipedia.org/wiki/Exterior_algebra">Exterior Algebra</a> (EA), which is sort of like chapter two of linear algebra if undergraduate linear algebra is chapter one. EA is basically well-known to mathematicians and physicists but not commonly taught at the undergraduate level; my impression is that it is gradually becoming ubiquitous at the graduate level. Clifford Algebra is sort of like a more advanced version of EA; it is much less widely known but it is well-established in certain subfields of math and physics. Geometric Algebra attempts to take the ideas of Clifford Algebra and Exterior Algebra and spread them much more broadly, rephrasing other aspects of math in terms of those new concepts and operations. Exactly which ideas and operations those are depends on the author, but everyone pretty much agrees that operations from CA and EA are useful and ought to be more widely used.</p>
<p>Here is how the various algebras relate to each other:</p>
<p>Exterior Algebra is based on the the ideas of the “exterior” (or “wedge”) product \(\b{a} \^ \b{b}\), the notion of multivectors in a vector space, and some other natural operations that those imply (the Hodge star \(\star\) and an interior product, which generalizes the dot product to multivectors.) It comes up naturally in abstract algebra (the exterior algebra is a basic quotient of the tensor algebra which is a basic construction on vector spaces), and in various fields downstream of abstract algebra, such as algebraic topology. It also seems to exist in combinatorics literature with a somewhat different set of notations. More generally, EA provides the only actually good way of looking at vector algebra concepts like determinants, matrix minors, and cross products, by modeling them as multivectors in an explicit vector space \(\^^k \bb{R}^n\); this perspective seems to be gradually infiltrating the literature that touches those subjects, but it mostly has not made its way into the undergraduate curriculum yet.</p>
<p>Clifford Algebra is, roughly, an extension of EA which allows it to generalize the complex and quaternion number systems. They are formally constructed as a quotient of a tensor algebra, but the procedure is very intuitive. To multiply two multivectors like \((\b{xxy})(\b{xy})\), you are allowed to exchange elements using \(\b{xy} = -\b{yx}\) and to cancel out elements using a bilinear form \(Q\), such that \(\b{xx} = Q(\b{x}, \b{x})\). \(Q(\b{x}, \b{x})\) is almost always defined to be \(1\), \(0\), or \(-1\). The resulting algebra is labeled \(Cl_{p,q}\) or \(Cl_{p,q,r}\) where \(p\) is the number of elements that square to \(1\), \(q\) the number that square to \(-1\), and \(r\) the number that square to \(0\). As a result you get an associative algebra where most elements are invertible (so you can talk about the multiplicative inverse of a vector \(\b{a}^{-1}\)). This allowing you to take a bunch of objects, expressed as sums of multivectors, and basically do polynomial algebra on them.</p>
<p>Physicists probably know EA specifically from its ubiquitous use in General Relativity, where it shows up as the “exterior calculus of differential forms”, which are used to express differential geometry in a coordinate-free way. (For some reason, for the later half of the 20th century, physics mostly treated EA as <em>just</em> a thing you do with differential forms, instead of a general theory of vector algebra.) Physicists also have usually used Clifford Algebras, because they’re the algebra of the Pauli and Gamma matrices from quantum mechanics, although at least at the undergraduate level it’s rare that physics actually refers to them by name (my quantum course did not even mention that the Pauli matrices were quaternions). Clifford Algebras are also apparently heavily connected with the theoretical frameworks that underly spinors and more generally representation theory in a way that I don’t quite understand yet.</p>
<p>Point is, GA does not have any claim to the wedge product or even the Clifford/geometric product per se. GA’s main difference is that it attempts to go back to lower-level math—vector algebra, calculus, basic physics like mechanics and electromagnetism—and rephrase them in terms of the Clifford/Geometric product, plus the other concepts of EA. In practice GA refers mostly to <strong>the particular platform and social movement which descends from the work of David Hestenes in the 1960s</strong>. Specifically, it does <em>not</em> refer to the underlying material of Clifford Algebras, but rather how they are used, conceptualized, and taught. Sometimes GA people will defend GA by saying “it’s just Clifford Algebra which is really important in math, how can you have a problem with it?”, and that misses the point. GA is not the material itself; it’s the material plus the ideology and framework that is draped over the material.<sup id="fnref:ga" role="doc-noteref"><a href="#fn:ga" class="footnote" rel="footnote">1</a></sup></p>
<p>Probably both GA and EA could be considered as members of the larger subject of “multilinear algebra” which would include tensor analysis and all of linear algebra as well. There’s an argument to be made that they are really just “the rest of linear algebra”, the big part of it that isn’t included in introductory texts and hasn’t entered the mainstream yet. Perhaps in 100 years they will be fully folded into the standard curriculum.</p>
<hr />
<h3 id="an-approximate-history-of-ga">An Approximate History of GA</h3>
<p>GA is mostly less well-known in mathematics and physics than EA. Yet has a strangely large number of people advocated for it, particularly online in communities, articles, videos, and conference talks. Why? Well, here’s a basic history that sort-of explains why it was unknown in the first place, and why it is increasingly not.</p>
<p>GA was more-or-less invented in the late 1950s by <a href="https://en.wikipedia.org/wiki/David_Hestenes">David Hestenes</a>, who made it his project to popularize it for the next fifty years. The underlying ideas date back to Clifford around 1878, who himself was extending Grassmann’s “Extensive Algebra” from the 1840s. (I’m told that Clifford coined the term “geometric product” for the operation.) The story is that 1959 Hestenes, unsure what original research to do in graduate school, randomly came across some published lecture notes of Marcel Riesz in the UCLA library on the subject of Clifford Algebras, immediately realized the similarity to the gamma matries of quantum mechanics, and decided that this was the right way to reformulate all of vector algebra.<sup id="fnref:story" role="doc-noteref"><a href="#fn:story" class="footnote" rel="footnote">2</a></sup> Half of his thesis became about the subject,and most of the content was later published in a 1066 book called “Space-Time Algebra” which gradually popularized his ideas quite a bit. Hestenes’ plan was to transform Clifford Algebra, otherwise a fairly niche part of mathematics, into the main language in which all of physics was expressed: that everything should be reformulated in terms of multivectors, wedge products, and especially the geometric product. IMO that switch is what changes the subject of “Clifford Algebra” into the subject of “Geometric Algebra”.</p>
<p>Hestenes slowly popularized his ideas and published more papers on the stuff for a few decades,<sup id="fnref:traction" role="doc-noteref"><a href="#fn:traction" class="footnote" rel="footnote">3</a></sup> but it sounds like things really started to pick up with a conference at the University of Kent which was started by Roy Chisholm in 1985, as well as with the various works of Pertti Lounesto.<sup id="fnref:lounesto" role="doc-noteref"><a href="#fn:lounesto" class="footnote" rel="footnote">4</a></sup> Momentum picked up when a group out of Cambridge (<a href="http://geometry.mrao.cam.ac.uk/">website</a>) got interested around 1988, in particular Anthony Lasenby, Joan Lasenby, Steve Gull, and Chris Doran. The Cambridge group began putting out papers in the 90s with names like “Imaginary Numbers are not Real” and “A unified mathematical language for physics and engineering in the 21st century” which appealed a lot to people who had, for instance, lingering reservations about the use of complex numbers in physics, or about the philosphical interpretations of Pauli matrices and spinors.</p>
<p>Over the years Hestenes and especially the Cambridge group seem to have published papers which reformulate every part of physics in terms of GA. They were refreshing when compared to mainstream physics, which seemed dogmatic and bizarrely willing to accept things that should have still been up for debate. (Of course, that is how fringe ideas always view mainstream ideas.) So the appeal, and the reason that keeps catching more people’s interest, is that it is at least trying to solve things that are legitimately bothersome (to a certain type of person).</p>
<p>By the time I was learning about differential forms in college around ~2010, knowledge of GA was ambiently floating around online and I came across it at some point, and a bunch of fairly accessible books had been published (I’m thinking of Doran/Lasenby’s <em>Geometric Algebra for Physicists</em> in 2003, Fontijne/Mann/Dorst’s <em>Geometric Algebra for Computer Science</em> in 2007, and Macdonald’s <em>Linear and Geometric Algebra</em> in 2010). The ideas of GA tended to show up if you went googling for intuitive explanations about forms, spinors, quaternions, or gamma matrices.</p>
<p>There is a type of person (of which I am one) who will never be happy with the way that complex numbers are used in quantum mechanics,<sup id="fnref:complex" role="doc-noteref"><a href="#fn:complex" class="footnote" rel="footnote">5</a></sup> while everybody else goes on not worrying about it because the theories do, at least, get the right answers. Such people also tend to be displeased with the arbitrariness of the Pauli and Gamma matrices, and all of the index-juggling manipulations of general relativity, and the use of commutators in quantum mechanics and analytic mechanics, and other such “non-geometrical” operations. Most frustrating, though, is the fact that most of the physics literature seems to not regard these things as problems, and that you are just supposed to learn and use them and so you don’t find a lot of sympathy when you want to make sense of them. (Pauli matrices, at least, could be recast in terms of the quaternions \(\bb{H}\), but that doesn’t help, given that quaternions are just a higher-dimensional version of \(\bb{C}\).)</p>
<p>Unfortunately, while the GA folks were clearly onto <em>something</em>, the response has not been all that enthusiastic and the uptake is slow. Why? Well, I think it’s because in practice their reformulations are not all that useful for actually doing or understanding physics. (My opinion, but seriously, go read them.) It turns out that writing everything in terms of the geometric product does not make it easier to understand. The valuable parts, I think, were the parts that were using the <em>wedge product</em> more liberally than physics had before; the usage of the geometric/Clifford product was always quite a bit more suspect (but that’s the point I’m going to make in detail later). As a result a lot of physicists are probably aware of geometric algebra but not a lot of people are publishing papers in that language, and very few people are teaching, like, undergraduate classes about it.</p>
<hr />
<p>Okay, that’s the physics angle. But there is another part of the story in more applied fields: computer graphics and robotics.</p>
<p>Around the same time time period (1990s-2010s), a lot of new people were writing code that had to handle rigid motion in space, and invariably they tended to encounter quaternions as well. By the 90s it had been understood for a long time that quaternions were the better way of modeling rotations in 3d space, compared to e.g. Euler Angles, because they treated every axis of rotation equally and allow for smooth interpolation between any two points (and also they avoid “gimbal lock”, as nobody ever neglects to mention). In hindsight, it turns out that when you want to model an unfamiliar noncommutative algebra like \(SO(3)\), it is very important to use an actual algebraic model of it instead of a poor approximation. But quaternions are a pedagogical nightmare (go look how many Youtube videos there are explaining them!) So when the GA people came along and started talking about bivectors and rotations in a way that actually made some sense, a lot of people were interested.</p>
<p>And there are many more difficult computations in these fields which are more complex than simple rotations. Rigid motions of objects, for instance, involve moving a lot of lines, planes, tangent vectors, etc in space, plus interpolating between their positions smoothly, plus doing all kinds of intersection tests, sidedness tests, for e.g. culling objects which are occluded or offscreen, plus all of this has to be projectively transformed according to the position of a camera or sensor. Quite a bit of literature on GA has come out of translating these operations into GA terms, and there are a number of vocal proponents of doing it this way: a good chunk of the GA literature you come across is in graphis or robotics papers and conference proceedings.</p>
<p>These applications tend to focus on particular choices of Clifford algebras which are suited to different types of geometric problems. It was already widely understood that projective geometry allowed one to represent rotations and translations in \(\bb{R}^3\) with a single linear operator on \(\bb{R}^4\). Geometric algebra extends this by starting from \(\bb{R}^3\) and adding some number of additional basis vectors that allow modeling various kinds of objects with Clifford-Algebraic operations. These theories attempts to replace the existing quaternion-like formulations of rigid geometric, such as <a href="https://en.wikipedia.org/wiki/Screw_theory">screw theory</a> or <a href="https://en.wikipedia.org/wiki/Dual_quaternion">dual quaternions</a>.</p>
<p>I believe the first of these was Projective Geometric Algebra (PGA) from a paper by Jonathan Selig in 1999 (in a robotics journal). (I’m not sure which ideas are attributable to him, though: he cites Ian Porteous’s 1969 book <em>Topological Geometry</em> which has a bunch of Clifford-Algebra-based geometric algorithms, although in general it’s written in the much more technical “Clifford Algebra” style instead of the “Geometric Algebra” style; also it mention that Clifford himself studied biquaternions to implement rigid body motions.) Selig’s PGA uses \(Cl_{0,3,1}\), but it seems like later authors use \(Cl_{3,0,1}\) instead. “Conformal Projective Geometric” follows shortly after and uses \(Cl_{4,1,0}\) to model the objects but now also includes circles and rotations as multivectorial objects with the help of an additional basis vector. I don’t know a lot about these, except that the texts on them are strangely difficult to read because they use very unorthodox representations for basic objects like planes and points: for instance in CGA a point \(\b{x}\) is modeled as \(\b{x} + \frac{1}{2} \b{x}^2 (e_{-} + e_{+})\)? Seems weird to me but some people seem get very excited about it.</p>
<hr />
<p>As for pure math—it seems like research mathematics readily talks about and uses “Clifford Algebra”, but is uninterested in or specificaly avoids the terms and concepts that are specific to Hestenes’ “Geometric Algebra”. I can speculate as to why: even by the 90s/00s, GA had gotten a bad reputation because of its tendency to attract bad mathematicians and full-on crackpots (Hestenes honestly sounds like one a lot of the time, and I’m not really sure whether he is or isn’t). It makes sense, really. There are a lot of people who found it appealing for the reason I did: because the existing models of vector algebra and quaternion rotation were deeply unsatisfying. But it turns out that those reasons <em>disproportionately</em> attract people who are not actually capable of rigorous mathematics, or are slightly prone to conspiratorial thinking, or are otherwise slightly deranged (also like me? TBD).</p>
<p>I guess there are more people who can tell when math is weird or unsatisfying or bad… than can do good math themselves. So GA ended up appealing to a lot of fringes: people who only had undergraduate degrees, people who had dropped out of PhDs, people with PhDs from unrigorous programs, people who had been good at math but were perhaps going a bit senile, random passerbies from engineering or computer programming, run-of-the-mill circle-squarers, people who had a bone to pick with establishment mathematics and felt like all dissenting views were being unfairly suppressed… And these were the people who started publishing a lot of stuff about GA, often dressed up to look like more serious research than it was. Indeed, if you look around for papers that explicitly talk GA, they <em>very</em> disproportionately (a) are non-theoretical, (b) are poorly-written, (c) are trivial, i.e. restating widely-known results as if they’re novel, (d) only cite other GA papers, and of course (e) are just plain crackpotty.<sup id="fnref:vixra" role="doc-noteref"><a href="#fn:vixra" class="footnote" rel="footnote">6</a></sup></p>
<p>It didn’t help that a lot of the texts by the <em>actually</em>-competent GA people, like the Cambridge group, tended to say things that sounded and still sound kind of crackpotty as well. LIke they would constantly say things like “this new theory is going to fix everything”, which is exactly what the crackpots also say, and for the same reasons (the validity of a statement like that is completely conditional on a person’s ability to actually distinguish truth from fiction). Or they were just filled with unnecessary ostentatiousness, such as (quoting here) “we have now reached the part which is liable to cause the greatest intellectual shock”. Or acting like results in GA are new and novel when they’re clearly just using wedge products the same way that physicists had regularly done for decades. Or acting aggrieved that the rest of mathematics is ignoring them. Or, worst of all, claiming that all the new GA equations are simpler than the old ones, while referring to equations which were <em>clearly not simpler than the old ones</em>.</p>
<p>So I suspect that what has happened is that competent mathematicians have tended to distance themselves from the term Geometric Algebra due to its dubious reputation (a sort of adverse selection). Which of course leaves GA with an even higher ratio of cranks, because most of the non-cranks left. In fact I suspect that mathematicians sometimes publish papers about “Clifford Algebra” when they want to talk about the <em>exact same material</em>, not even the super-theoretical version, but without the negative associations. And some of the serious GA-adjacent research on ArXiv is just under the name Clifford Algebra instead.</p>
<p>To be clear, I don’t think it has been wrong to disassociate from the name “Geometric Algebra”. GA’s dubious reputation among mathematicians is <em>well-deserved</em>. I’m doing it too—that’s why all of my posts on related subjects are about exterior algebra instead (well, that and I am not trying to talk about the Clifford/geometric product).</p>
<p>But that doesn’t mean GA isn’t <em>also</em> onto something. It just means that there’s a lot of low-quality stuff under the same label, which has made that label questionable, and if you want to sift through it you have to be ready to filter for quality yourself.</p>
<p>Also, as a result of its popular appeal and fringe status, there are a lot of online discussions dedicated to GA. Actually a shocking number, if you go look for them. off the top of my head there’s a website called <a href="https://bivector.net/">Bivector.net</a> which has forums and a Discord that (as of this moment) has 200+ people online, which I guess feels like a lot for a <em>community about a fringe mathematical theory</em>. Plus a few other forums. Plus people show up talking about GA in the comment sections on every other math-related forum if anyone asks any questions about quaternions or bivectors. Plus there are the countless Youtube videos, conference talks, expository PDFs, standalone websites, etc. And then there are whole offshoots of GA, like Conformal Geometric Algebra and Plane-Based Geometric Algebra, that have their own enthusiasts and sometimes their own websites as well. Etc.</p>
<p>This is not really a bad thing either. If anything what it shows is <em>how many</em> people are passionate to see math reformulated in a way that makes more sense—so many that they’ll convene and talk about it on every one of the bizarrely-inadequate social networks we have in 2024. And that’s part of what is motivating me to write this article (which is getting very long now…). GA has got something of the right idea and people recognize that and latch onto it. I happen to think that it is almost certainly right that modern mathematics needs a more intuitive foundation. Research math knows a <em>lot</em> about geometry, but although most of the knowledge required to do all the things people actually want to do with geometry is out there <em>somewhere</em>, it’s not accessible or intuitive and the details are only really available to specialists.<sup id="fnref:reddit" role="doc-noteref"><a href="#fn:reddit" class="footnote" rel="footnote">7</a></sup> At some level GA is trying to “democratize” geometry.</p>
<p>So basically I <em>do</em> agree with them: GA is onto something, which is that geometry deserves more intuitive foundations, and multivectors and the like are probably a big part of it. The problem is that… GA isn’t quite it.</p>
<hr />
<h1 id="2-the-actual-case-against-ga">2. The Actual Case Against GA</h1>
<p>As I wrote above: <em>Exterior Algebra</em> is clearly valuable and widely used already in graduate-level math and physics. <em>Clifford Algebra</em> is clearly widely used in theoretical mathematics and anything that has to do with spinors. So <em>Geometric Algebra</em> ought to be evaluated on what it adds on top of those.</p>
<p>So what does GA specifically say?</p>
<p>As I see it, GA is not so much a subject as an ideological position, consisting of basically two ideological claims about the world:</p>
<ol>
<li><strong>Claim 1</strong>: That the concepts of EA (so, wedge products, multivectors, duality, contraction) are incredibly powerful and ought to be used everywhere, starting at a much lower level of math pedagogy—basically rewriting classical linear algebra and vector calculus.</li>
<li><strong>Claim 2</strong>: That the Geometric Product (henceforth: GP) should be added to that list as the most “fundamental” operation, where by “fundamental” I mean that they would have all of the other operations constructed in terms of it and generally state theorems in terms of it.</li>
</ol>
<p>Claim (1), I believe, is completely correct, and is responsible for much of the reason GA <em>has</em> gotten so popular. Exterior algebra and the general idea of doing geometry with multivectors is incredibly powerful and intuitive and it ought to be widely used and taught to everybody, and we should all be reading and writing new textbooks that incorporate it. It’s so obviously true that I’m not even going to talk about it after this paragraph. <em>Of course</em> \(n\)-vectors make more sense than determinants. <em>Of course</em> differential forms make more sense than nested integrals and mysterious Jacobians. <em>Of course</em> wedge products make more sense than cross products. <em>Of course</em> bivectors make more sense than \(\bb{C}\). <em>Of course</em> we should use multivectors instead of “pseudovectors” and “pseudoscalars”. Why are we even talking about it? Just go rewrite all the books, the theory is (mostly) there.<sup id="fnref:communication" role="doc-noteref"><a href="#fn:communication" class="footnote" rel="footnote">8</a></sup></p>
<p>Claim (2), I believe, is nonsense.</p>
<p>And with that, it’s time to talk about the geometric product.</p>
<hr />
<h3 id="the-geometric-product">The Geometric Product</h3>
<p>The geometric product of two vectors gives a mixed-grade object consisting of a scalar part (their dot product) and a bivector part (their wedge product). GA likes to write this product as juxtaposition:</p>
\[\b{a} \b{b} = \b{a} \cdot \b{b} + \b{a} \^ \b{b}\]
<p>(The general geometric product between two mixed-grade multivectors follows by writing them all out as sums of products of vectors like the above, then cancelling everything out according to \(\b{xx} = 1\) and \(\b{xy} = -\b{yx}\) for all choices of \(\b{x} \neq \b{y}\).)</p>
<p>Right away we’re confronted by the first problem. What does it even mean to have a “mixed-grade” multivector? The product of two vectors has a scalar part and a bivector part. Why?</p>
<p>I assume that the actual reason it happened, historically, is that it’s roughly what complex numbers and quaternions do already. Complex multiplication seems to involve two objects with different types:</p>
\[\begin{aligned}
(a + bi) (c + di) = (ac - bd) + (ad + bc)i
\end{aligned}\]
<p>And quaternion multiplication of e.g. two vectors seems to produce objects of mixed grades:</p>
\[\begin{aligned}
(a_1 \b{i} + a_2 \b{j} + a_3 \b{k}) (b_1 \b{i} + b_2 \b{j} + b_3 \b{k}) &= a_1 b_1 \b{i}^2 + a_2 b_2 \b{j}^2 + a_3 b_3 \b{k}^2 \\
&+ (a_1 b_2 - a_2 b_1) \b{ij} + (a_2 b_3 - a_3 b_2) \b{jk} + (a_3 b_1 - a_1 b_3) \b{ki} \\
&= - \b{a} \cdot \b{b} + \b{a} \times \b{b}
\end{aligned}\]
<p>Where the first part is a scalar (don’t mind the minus sign, that’s quaternions being weird<sup id="fnref:quaternion" role="doc-noteref"><a href="#fn:quaternion" class="footnote" rel="footnote">9</a></sup>) and the second part is a vector written in the basis \((\b{i}, \b{j}, \b{k}) = (\b{jk}, \b{ki}, \b{ij})\).</p>
<p>Still, you have to explain what your geometric algebra is doing with mixed-grade objects. Do they… mean something? What is the scalar part? What would it mean to have a sum of a scalar, vector, bivector, and pseudoscalar? Or are they just formal linear combinations of things with no meaning? What is going on?</p>
<p>Not only that, you need the mixed-grade objects to actually be <em>better</em> than they were before you wrote them that way. For instance you <em>can</em> write the electromagnetic field as \(\b{F} = \b{E} + \b{I} \b{B}\). But should you? Probably not. \(\b{E}\) is better understood as being a \(\b{x} \^ \b{t}\) bivector while \(\b{B}\) is an \(\b{x} \^ \b{y}\) bivector, so they’re both bivectors. The mixed-grade interpretation only makes sense if you confine yourself to \(\bb{R}^3\) for some reaseon. There are <a href="https://math.stackexchange.com/questions/3805595/are-there-any-geometrically-meaningful-useful-mixed-grade-objects-in-geometric-a">other examples</a> of rewriting things as mixed-grade objects, but, notably, none of them… seem good? Writing equations in terms of mixed-grade multivectors in general doesn’t <em>tell you anything useful</em>. You can’t “think in them”. Or at least, I can’t.</p>
<p>The approximate answer to “why is GA using mixed-grade objects and multiplying them?” is that it tends to end up expressing a lot of <em>operations</em> on multivectors as multivectors themselves. For example it will regard a unit vector as a reflection operator, or a scalar + bivector as a rotation operator. In this scheme, the product of two multivectors is generically interpreted as the composition of these operators. That is fine!</p>
<p>However, GA is not very forthright about the fact that it is doing this, and will happily go on talking about mixed-grade multivectors that refer to geometry primitives, often by just saying “multiply these with the GP, then take only the grade-2 part to get their area” and things like that. As far as I know there is no reason to do this except that they really like the GP! And it is very offputting when it is used this way: if you wanted their wedge product, just write their wedge product; don’t tell me to apply the GP to produce something that’s partly meaningless and then extract the meaningful part from that. (And anyway, if your goal is to pre-multiply vectors in a generic way and then extract useful components out of the result… you should be using the tensor product, not the geometric product, right?)</p>
<p>So that’s a problem: <strong>there is no good general interpretation or usage for the geometric product or mixed-grade multivectors</strong>. There are usages and interpretations in special cases, but the generic operation is not meaningful. Yet it is used everywhere as the fundamental object of the theory. (For instance <a href="https://math.stackexchange.com/questions/1535878/visualizing-the-geometric-product">here</a> are some people struggling to find to find a general interpretation of the GP.) It is very awkward that the basic geometric operation in the geometric algebra that people espouse because they’re trying to make everything geometrically intuitive… is not very geometrically meaningful on its own.</p>
<p>Incidentally, you would not want to actually use the geometric product to do these calculations, like, numerically. If you want to calculate dot products, wedge products, rotations, reflections, etc, or especially if want to program them into a computer, the last thing you want to do is implement them as arbitrary products of mixed-grade-multivectors and then project out certain terms at the end that you care about. Because of course you don’t: you really want to just implement the actual operation you were trying to use; doing it that way would be both tedious and a giant waste of memory and computational power. The reason you would use the GP is when all your objects are geometric operations that are already expressed as mixed-grade multivectors, so you can commute and anti-commute the terms in their components to compose them. In that case, go for it. But it is not like you want to be actually using the GP on a computer to perform operations that GA defines in terms of it, such as dot or wedge products. Nor would you want to use it to perform basic operations by hand. Basically the GP is useful for algebraic manipulations, not numeric ones.</p>
<hr />
<h3 id="rotations-and-reflections">Rotations and Reflections</h3>
<p>As I said above, the main place that the GP’s behavior makes some sense is when the multivectors are being regarded as operators on geometric objects, rather than the geometric objects themselves. For instance:</p>
<p>(1) A basic rotation as implemented by exponentiating a bivector:</p>
\[e^{\theta (\b{xy})} = \cos \theta + (\b{xy}) \sin \theta\]
<p>Which operators on vectors like so:</p>
\[e^{\theta (\b{xy})}(\b{x}) = \b{x} \cos \theta - \b{y} \sin \theta\]
<p>(2) Or a better type of rotation is implemented by sandwiching an object between two “rotors”, which are half-angle rotations (which is necessary to produce the correct <a href="https://en.wikipedia.org/wiki/Rodrigues%27_rotation_formula">Rodrigues formula</a> for rotations in \(>2\) dimensions<sup id="fnref:Rodrigues" role="doc-noteref"><a href="#fn:Rodrigues" class="footnote" rel="footnote">10</a></sup>):</p>
\[\b{v} \mapsto e^{\theta \b{B}/2} \b{v} e^{-\theta \b{B}/2}\]
<p>The intermediate object in this case is \(e^{\theta \b{B}/2} \b{v} = \cos (\theta/2 )\b{v} + \sin (\theta/2) \b{B} (v_{\parallel} + v_{\perp})\). If \(v_{\perp}\) is perpendicular to the plane of rotation then \(\b{B} \b{v}_{\perp}\) becomes a trivector temporarily before being turned back into a vector by the second copy of \(\b{B}\).</p>
<p>(3) The reflection operator \(-\b{n} \b{v} \b{n} = \b{v}_{\perp n} - \b{v}_{\parallel n}\) which reflects a vector along the unit \(\b{n}\).</p>
<p>In each case we are using multivectors, and constructing intermediate mixed-grade multivectors, in order to transform a vector in some way (and there are extensions to multivectors). What seems to happen is that the scalar terms that show up in the geometric product in each case is responsible for performing the “identity” part of the operation which leaves its argument unchanged, while the bivector (or whatever) term is responsible for the part that gets transformed. Then the GP implements “composition” of these operators.</p>
<p>This becomes more clear in some of the more “exotic” geometric algebras out there: in practice GA people like to add more basis vectors, creating Clifford algebras like \(Cl_{3,1}\) which has three basis vectors that square to \(+1\) and one that squares to \(-1\), or \(Cl_{3,0,1}\) that has three \(+1\)s and one that squares to \(0\). Each of these produces a different sort of “algebra of operations”, and in each case the geometric product is used to compose them. Versions of this produce geometric algebras that include as primitives things like translations, Lorentz transformations, or screw-motions, and then their geometric product composes those operations.</p>
<p>The <em>fact that you can do this</em> is certainly cool and neat, and profitable if you need to compose a lot of those operations. <strong>But GA tends to act like this algebra which it has constructed to perform operations on a geometry… “is” the “right” way to do geometry</strong>. Really it’s just an implementation detail. If GA replaces all the vectors with planar reflections… well, vectors are still a thing, as is their wedge product. The fact that you built operators out of a quirky reinterpretation doesn’t make the old things go away. GA’s tendency to act like it is “better” than other approaches is very alienating: they all the same thing, and GA has just picked a few things and turned them into primitives, at the cost of making other things more complex. The GA in use is <em>not</em> the canonical algebra of basic vectorial objects, but the algebra of a certain class of vectorial transformations on those objects that were chosen for the problem at hand.</p>
<p>I strongly believe that if GA would make this distinction they would lose a lot fewer people. It is a completely interesting and useful thing to talk about “a representation of a particular class of operations that makes composition and inversion easy”, and completely offputting when you blur the distinction between operators and geometric objects themselves, and write every operation in terms of the geometric product when only a few of them are really compositions of operators.</p>
<hr />
<p>It’s worth considering what things would look like in a different model. The normal non-GA way to model a rotation operation, for instance, is with the <a href="https://en.wikipedia.org/wiki/Exponential_map">exponential map</a> of a generator \(R_{xy}\):</p>
\[\begin{aligned}
e^{\theta R_{xy}} (\b{x}) &= (I \cos \theta + R_{xy} \sin \theta) \b{x} \\
&= I(\b{x}) \cos \theta + R_{xy}(\b{x}) \sin \theta \\
&= \b{x} \cos \theta + \b{y} \sin \theta \\
\end{aligned}\]
<p>Where \(I\) is the identity operator. (Don’t mind the sign change compared to GA’s version, it’s basically a choice of convention.) \(R_{xy}\) is the generator of rotation that simply performs the basic operation, while the exponential map “smears it out” and applies it over and over in infinitesimal amounts. \(R_{xy}\) may be written as a matrix:</p>
\[\begin{pmatrix} 0 & 1 & 0 \\ -1 & 0 & 0 \\ 0 & 0 & 1\end{pmatrix}\]
<p>but it’s fine and perhaps better, to leave it as a symbol: the matrix is just a representation of it in a particular basis.</p>
<p>The operator version of the exponential map produces an object whose two components have the same type: both are “operators that map vectors to vectors”. Whereas the GA versions produce two objects with different types: a scalar and a bivector, which both happen to give a vector when multiplying a vector. In the operator version, the first term happens to be identity operator which you <em>could</em> write as a scalar \(1\)… but it seems more natural to me that both \(e^{\theta R_{xy}}\) and its expansion in terms of \(\cos \theta\) and \(\sin \theta\) are of the “same type” throughout. And although the identity operator \(I\) could be written as \(1\), it is just as good to regard it as a tensor product \(\b{x} \o \b{x} + \b{y} \o \b{y} + \b{z} \o \b{z}\). Either way, GA’s trick of “removing the vector part, then putting it back” is just… weird, I guess?</p>
<p>More to the point, these objects have the same algebra. If you write out your rotations as operators or geometric-products of mixed-grade multivectors, they do the same thing. The choice of representation is there for its utility, not for its underlying mathematical truth, and <em>pretending</em> like it is mathematical truth is disingenuous and offputting.</p>
<p>Also, I happen to find the operator version a lot more appealing. Sure, it is interesting that GA’s version works, but since the intermediate objects aren’t interpretable as actual geometric primitives (like: a sum of a scalar and a bivector is not a thing in the world of “vectorial directions, areas, and volumes”—only in the world of “operators”), it is unsatisfying. Operators are a slightly different thing than multivectors, and the distinction is important. They have different “types”. Conceptually, vectors are not rotations or reflections or translations on their own; multivectors are not rotations on their own. But they can be put in <em>correspondence</em> with rotations or reflections or translations, yes, for instance we use unit bivectors for the purposes of defining the planes that rotations happen in. But I think it is a mistake to <em>identify</em> them with rotations and other operators, and everything else goes awry as a result.</p>
<p>This also happens in \(\bb{C}\) as well, by the way. We learn to regard \(a + bi\) as an operation on other complex numbers \(r e^{i \theta}\), which rotates and scales them, but really that is actually… pretty weird? Most of the time we think of complex numbers as vectors in \(\bb{R}^2\) or as rotation+scaling operators, but rarely do we actually we want them in both roles at the same time. So it is not very natural to equate the two objects, as opposed to finding a correspondence between them.</p>
<p>Well, GA would phrase this as the vector interpretation being \(a \b{x} + b \b{y}\) and the operator interpretation as \(a + b I\). But I would argue that even the bivectors and scalars should not be interpreted as operators either. Bivectors are not operators: they’re elements of a vector space that models units of area in planes. If the plane is created by two geometric rays then the unit of area is a vectorial representation of a patch of area. If the plane is created by two operations on vectors, then the unit of area is a vectorial representation of some sort of antisymmetrized product of those operators. That’s all fine! They’re just <em>different spaces</em> that have similar algebras. Rotations can be <em>implemented</em> with them, yes, because rotations take place in planes, but they are not the same thing: bivectors-as-vectorial-areas only become rotation operators <em>when you contract with one of their indexes</em>, which is a separate step that you would perform on purpose.</p>
<p>So GA ends up being very stuck because it equates “vectorial objects” and “operators that act on vectorial objects”. It would be better to express all the geometric objects you care about in their most natural forms, and then find isomorphisms between them when it’s necessary to do so. Otherwise all the meanings get blurred together and it’s very confusing. So that’s another problem with geometric algebra: <strong>eliding the distinction between vectors and operators is undesirable, confusing, and disingenuous</strong>. The GP is only geometrically meaningful, to my knowledge, in the context of “representations of certain classes of geometric operators as implemented in particular Clifford Algebras”, and treating it like it is some general-case thing turns a lot of people away from the start.</p>
<hr />
<h3 id="weird-formulas">Weird Formulas</h3>
<p>A related problem is that even when you <em>do</em> treat multivectors as operators, the interpretations are… kinda weird? Consider the reflection operation:</p>
\[P_{\b{n}}: \b{v} \mapsto - \b{n} \b{v} \b{n}\]
<p>Where \(\b{n}\) is a unit vector that we’re reflecting along the axis of. This works because if you decompose \(\b{v} = \b{v}_{\parallel n} + \b{v}_{\perp n}\) you can see that it flips the parallel part but not the perpendicular part (recall that parallel vectors have zero wedge product while perpendicular vectors have zero dot product, or in GA terms, parallel vectors commute while orthogonal vectors anticommute):</p>
\[\begin{aligned}
P_{\b{n}}(\b{v}) &= - \b{n} \b{v} \b{n} \\
&= - \b{n}( \b{v}_{\parallel n} + \b{v}_{\perp n}) \b{n} \\
&= - \b{n} \b{n} \b{v}_{\parallel n} + \b{n} \b{n} \b{v}_{\perp n} \\
&= \b{v}_{\perp n} - \b{v}_{\parallel n}
\end{aligned}\]
<p>It’s neat that that works. But is it a good formula; does it make any sense? Not… really? Why would you reflect a vector by sandwiching it with a unit vector and adding in a minus sign? I doubt you could have guessed that formula without already knowing that it works, or by fiddling around with the geometric product for a while. And knowing it doesn’t really teach you how to write down any other formulas. The operator version is something you can build out of primitives that you know (I mean, if we were developing geometric algebra with operators we would have already defined the projection \(\b{v}_{\parallel n}\) and rejection \(\b{v}_{\perp n}\) operators at this point.)</p>
\[P_{\b{n}}(\b{v}) = \b{v}_{\perp n} - \b{v}_{\parallel n}\]
<p>A bit cludgy, but the meaning is clear. The GA representation is just that: a <em>representation</em>, in a particular algebra, that happens to work. But it is not a “natural” way to express the operation for most people’s purposes.</p>
<p>So that’s another complaint: <strong>Geometric Algebra’s sleek formulas, when it has them, don’t provide much useful geometric intuition</strong>. They’re just things you memorize. <em>Maybe</em> there’s a way to intuit the reflection formula if you think of all unit vectors as being reflection operations, but why bother? You’ll get more intuition out of operators.</p>
<hr />
<h3 id="ga-in-physics-pauli-and-gamma-matrices">GA in Physics: Pauli and Gamma Matrices</h3>
<p>In the standard GA over \(\bb{R}^3\), once you have defined the weird GP on vectors, the next step is to define the regular useful operations of EA in terms of it:</p>
\[\b{a} \cdot \b{b} = \frac{1}{2}(ab + ba)\]
\[\b{a} \^ \b{b} = \frac{1}{2}(ab - ba)\]
<p>This construction is appealing to people who came from physics because their first exposure to exterior algebra was probably in the form of the <a href="https://en.wikipedia.org/wiki/Pauli_matrices">Pauli matrices</a> (which show up in the quantum mechanics of a non-relativistic electron) and the <a href="https://en.wikipedia.org/wiki/Gamma_matrices">Gamma matrices</a> (which show up in the <a href="https://en.wikipedia.org/wiki/Dirac_equation">Dirac Equation</a> for relativistic electrons and positrons).</p>
<p>The gamma matrices, famously, have their symmetric product equal to the (Minkowski) metric \(\eta^{\mu \nu} = \text{diag}(1, -1, -1, -1)\):</p>
\[\{ \gamma^\mu, \gamma^\nu \} = \gamma^\mu \gamma^\nu + \gamma^\nu \gamma^\mu = 2 \eta^{\mu \nu}\]
<p>Which, after some unpacking of the notation, says that \(\gamma^0 = \b{t}\) and \(\gamma^i = \b{x}^i\) and that \(\b{t} \cdot \b{t} = +1\) while \(\b{x} \cdot \b{x} = \b{y} \cdot \b{y} = \b{z} \cdot \b{z} = -1\). Essentially these are the objects you need if you want to “square root” the Laplacian \(\p_t^2 - \p_x^2 - \p_y^2 - \p_z^2 = (\gamma^0 \p_t + \gamma^1 \p_x + \gamma^2 \p_y + \gamma^3 \p_z)^2\). Somehow it works but good look figuring out what it means! It turns out that the total set of Gamma matrices has 16 elements which correspond to the 16 elements of \(\^^4 \bb{R}^{1, 3}\): one scalar (the identity), four vectors (the regular Gamma matrices), six bivectors (the commutators of Gamma matrices), four trivectors (etc), and one pseudoscalar (the object usually written \(\gamma^5\)). And their multiplication rule is exactly the Clifford algebra on those objects with metric signature \((+1, -1, -1, -1)\). That is actually cool and interesting: apparently the Gamma matrices <em>do</em> implement the Clifford algebra \(\text{Cl}_{1,3}\), which GA calls the “spacetime algebra”. Evidently whatever a bispinor “is”, vectors act on them by multiplication using the Gamma matrices as their representation.</p>
<p>So if you were coming from knowing about Pauli matrice and Gamma matrices, finding out that there’s a way to interpret them all as basic elements of an intuitive algebra is very appealing. This, I think, is the reason Hestenes, Doran, Lasenby, etc were very interested in the geometric product as a generic tool for building vector algebra in the first place; their early papers are very targeted at physicists who are frustrated with spinors and bispinors not making much sense.</p>
<p>I agree that rewriting the objects in terms of their Clifford Algebra is a good idea. But I don’t agree that this means you should rephrase all of geometry in terms of the Clifford Product / Geometric Product. The GP is provisionally useful for complex numbers, quaternions/Pauli matrices, and Gamma matrices, and fantastically useful in general for spinors (apparently!)… but that doesn’t apply anywhere else. So why would you go rewrite all of vector algebra in terms of it? Anyway I’m counting it as another problem with GA: <strong>The fact that the GP shows up for mysterious reasons in the physics of spinors is no reason to use it for the rest of geometry.</strong> For purposes other than spinor-related algebra, an operator-first formulation does everything you want without a magic bizarre product, and it’s not like the spinor algebra makes a lot of intuitive sense anyway. Writing everything the way that they are produced in spinor algebra is not providing any intuition for anything.</p>
<p>(Incidentally, the fact that the Gamma matrices which convert between spinors and real numbers also obey a Clifford algebra is… really weird, isn’t it? I have trouble thinking of any kind of possible explanation that would lead to that. Each gamma matrix corresponds to a cardinal direction, their antisymmetric product gives a bivector, but also their symmetric product gives the identity—what could that possibly mean? It feels like it’s closely related to the, um Divine Understanding of Spinors, the interpretation that Atiyah was talking about not having when he said “No one fully understands spinors.”. Whatever they are, their symmetrization becomes the identity operator. It’s so <em>weird</em>.)</p>
<hr />
<h3 id="a-proliferation-of-operations">A Proliferation of Operations</h3>
<p>Not only are the GP and mixed-grade multivectors weird, they have to invent a bunch of weird other operators just to <em>undo</em> their awkwardness. Such as the grade projection operator</p>
\[\<A\>_k = \text{(grade-k component of )}A\]
<p>Or the “even” and “odd” grade projections:</p>
\[\<A\>^+ = \< A \>_0 + \< A \>_2 + \< A \>_4 + \ldots \\
\<A\>^- = \< A \>_1 + \< A \>_1 + \< A \>_3 + \ldots \\\]
<p>Or some really awkward definitions of every other kind of product:</p>
\[\begin{aligned}
A \, \lfloor \, B &= \sum_{r,s} \< \<A\>_r \<B\>_s \>_{s-r} \\
A \, \rfloor \, B &= \sum_{r,s} \< \<A\>_r \<B\>_s \>_{r-s} \\
A \ast B &= \sum_{r,s} \< \<A\>_r \<B\>_s \>_{0} \\
A \bullet B &= \sum_{r,s} \< \<A\>_r \<B\>_s \>_{\| s - r \|}
\end{aligned}\]
<p>None of any of this is necessary if you don’t use mixed-grade multivectors and the GP in the first place. And absolutely nobody wants to learn any identies involving these things. So that’s another complaint: <strong>there are way too many confusing definitions required when you base everything on the geometric product</strong>. I have to imagine that every single person who has gone to learn GA has been taken aback by this, and by the fact that the people writing about it don’t seem to have much of a problem with it. And actually I have only shown the tip of the iceberg here. Take a look, for instance, at the list of operations on the website <a href="https://projectivegeometricalgebra.org/">Projective Geometric Algebra.org</a>.</p>
<p>To be fair, the proliferation of operations is somewhat a problem in EA also: the <a href="/2019/01/27/exterior-4.html">Interior Product</a>, for instance, is fairly awkward to use, and like \(\lfloor\) and \(\rfloor\) above, there kinda needs to be two versions of it if you want to apply it from either the left or right. But at least it is has a fairly elementary interpretation of a simpler operation in the algebra, as the adjoint of the wedge product under the inner product. And there are other operations that show up, like the Meet \(\vee\) which is dual to the wedge product. But GA has all of these <em>plus</em> its extra unnecessary stuff.</p>
<p>More generally I think it is better construct all of these operations directly from the tensor algebra. I suspect that the “right way” (inasmuch as that phrase means anything) to think about vector algebra is to think of the two fundamental operations as being (1) the tensor product \(\b{a} \o \b{b}\) and (2) the dot product/trace \(\b{a} \cdot \b{b}\). Everything else is really constructed from these, and despite <a href="https://youtu.be/htYh-Tq7ZBI?t=1411">what people will say</a>, it is actually rather intuitive that the product of \((a_x \b{x} + a_y \b{y})(b_x \b{x} + b_y \b{y})\) would be \(a_x b_x \b{xx} + a_x b_y \b{xy} + a_y b_x \b{yx} + a_y b_y \b{yy}\). Or at least, it is certainly more intuitive than the geometric product. It is weird <em>at first</em> that multiplying two vectors would make a rank-\(2\) tensor, but it is not really different from the fact that multiplying two scalars with one unit each gives a scalar with two units: \(5 \text{m} \cdot 3 \text{s} = 15 \text{ m}\cdot\text{s}\).<sup id="fnref:units" role="doc-noteref"><a href="#fn:units" class="footnote" rel="footnote">11</a></sup> True, it is not invertible, but it shouldn’t be: it’s a very generic operation whose inverse is not a single value. If you produce a version of vector multiplication that <em>is</em> invertible, you have definitely erased information somewhere to make that possible, so it is certainly not the “true” meaning of vector multiplication.</p>
<p>Anyway, if you start with the tensor product, then your pedagogical task is to explain why anybody would then go and invent the exterior product from that, but that isn’t too bad: in some sense the dot product asks “what happens if you multiply two vectors and ignore the terms that aren’t parallel?” and then the wedge product asks “What happens if you multiply two vectors and ignore the terms that are parallel?”. Those are at least philosophical constructions, and although they’re not completely satisfying, they do pretty well.</p>
<hr />
<h3 id="weird-associativity">Weird Associativity</h3>
<p>Here’s an objection I’ve never seen elsewhere: <strong>geometric product’s associativity is actually really awkward for doing basic linear algebra</strong>.</p>
<p>A geometric algebra is what you get when you take the tensor algebra on basis vectors \(\{\b{x}, \b{y}, \ldots \}\) and assert that \(\b{xx} = 1\) and \(\b{xy} = -\b{yx}\) everywhere (so, quotienting by relations of those forms)<sup id="fnref:definition" role="doc-noteref"><a href="#fn:definition" class="footnote" rel="footnote">12</a></sup>. (Or in other metric signatures, that \(\b{xx} = Q(x,x)\) or whatever.) The geometric product itself is what the tensor product \(\o\) becomes under this mapping. Naturally it is associative because \(\o\) is:</p>
\[(\b{ab})(\b{c}) = \b{a}(\b{bc}) = \b{abc}\]
<p>But this definition is actually really awkward. Look what it does to the “squares”” of the multivectors:</p>
\[\begin{aligned}
(1)(1) &= 1 \\
(\b{x})(\b{x}) &= 1 \\
(\b{xy})(\b{xy}) &= -1 \\
(\b{xyz})(\b{xyz}) &= -1 \\
(\b{wxyz})(\b{wxyz}) &= 1
\end{aligned}\]
<p>What is going on? Well, every time you switch the position of two args, you get a minus sign, and… the pattern is weird. You know what would work <em>way better</em> than that? If the product operated left-to-right, like tensor contraction i.e. the dot product already does:</p>
\[\begin{aligned}
(1) \cdot (1) &= 1 \\
(\b{x}) \cdot (\b{x}) &= 1 \\
(\b{xy}) \cdot (\b{xy}) &= 1 \\
(\b{xyz}) \cdot (\b{xyz}) &= 1 \\
(\b{wxyz}) \cdot (\b{wxyz}) &= 1
\end{aligned}\]
<p>But if you made the geometric product operate left-to-right, it doesn’t work, because the wedge product part actually <em>does</em> care about the ordering. That’s how it works in regular exterior algebra already: the dot product and wedge product are different operations that associate differently; the dot product is left-to-right while the wedge product associates. So you end up making a choice between two conventions for your algebra:</p>
<ol>
<li>Make every basis multivector have norm \(1\), so that they all work according to basic linear algebra in intuitive ways, for instance projecting components out of a bivector with \(\b{B} \cdot (\b{x} \^ \b{y}) = B_{xy}\). Then add in some operators to implement the things that have to square to \(-1\), such as rotation operators \(R_{xy}^2 = -I\). Or,</li>
<li>Let basis multivectors square to \(\pm 1\) using the weirdy-associating geometric product, and then add more operations to the algebra that recover the idea of multivectors squaring to \(+1\) so you can do basic things like projecting bivectors onto their basis vectors \(\b{B} \cdot (\b{x} \^ \b{y}) \? -B_{xy}\) again.</li>
</ol>
<p>To me (1) sounds way better. Let linear algebra work like it should and just add in the parts you need to make operators that compose differently. Don’t break linear algebra just to make it look more like complex numbers!</p>
<p>And really, some of GA’s definitions of things in terms of the GP work <em>better</em> if you use left-to-right contraction. For example the weird minus sign in the rotation \(e^{(\b{x \^ y} )\theta} (\b{x}) = \b{x} \cos \theta - \b{y} \sin \theta\) up above goes away if \(\b{xy}(\b{x}) = \b{y}\). The definition of the cross product in terms of wedge product becomes \(\b{a} \times \b{b} = I \cdot (\b{a} \^ \b{b})\) instead of \(-I (\b{a} \b{b})\).</p>
<p>The way GA recovers the standard multivector inner product is with the “reversion operator”, which looks like</p>
\[A^{\dagger} = \< A \>_0 + \<A\>_1 - \<A\>_2 - \<A\>_3 + \<A\>_4 + \ldots\]
<p>Which just means that it reverses the order of vectors in a product:</p>
\[(\b{xyz})^{\dagger} = \b{zyx} = - \b{xyz}\]
<p>Such that the “standard” dot product of two multivectors (the one that returns \(1\) if they are the basis element) is implemented as</p>
\[(\b{xyz}) \cdot (\b{xyz}) = (\b{xyz})^{\dagger} (\b{xyz}) = 1\]
<p>It seems to me that they defined their operation to associate in the wrong way and then have had to construct this operation to undo the mistake.</p>
<p>Incidentally, reversion is basically a generalization of complex conjugation. GA likes the way that the GP associates because it preserves the “square root of \(-1\)” behavior of complex numbers and quaternions: \(i^2 = j^2 = k^2 = -1\). Then reversion is used to construct the vector norm again, which for complex numbers and quaternions is implemented with complex conjugation: \(\| a \|^2 = \bar{a} a = (a_x - a_y i) (a_x + a_y i) = a_x^2 + a_y^2\). I find it strange. It’s hard to just say that complex conjugation is not an important operation, but it’s also hard to say why it’s so important—that is, I can’t see a great philosophical argument for it. It is not an operation we really “want” to be using if we’re trying to make geometry simple and intuitive.<sup id="fnref:conjugation" role="doc-noteref"><a href="#fn:conjugation" class="footnote" rel="footnote">13</a></sup></p>
<p>I think once again the problem is the conflationg of “vectors” and “operators on vectors”. Vectors themselves, or any multivectors, ought to have have normal norms that square to \(1\). Operators on vectors, such as rotations and reflections, can square to whatever they need to square to; naturally \(R_{xy}^2 = -I\), which, fine, write as \(-1\) if you want to use a Clifford Algebra to represent it, and implement a version of complex-conjugation to extend this to your mixed-grade operators. But don’t go around telling people that for some reason bivectors that represent units of surface area <em>also</em> square to \(-1\), because that’s crazy.<sup id="fnref:partial" role="doc-noteref"><a href="#fn:partial" class="footnote" rel="footnote">14</a></sup></p>
<hr />
<h2 id="vector-division">Vector Division</h2>
<p>One other thing that GA emphasizes from early on is the fact that most of the time you can divide by vectors and (usually) multivectors:</p>
\[\b{v}^{-1} = \frac{\b{v}}{\|\b{v} \|^2}\]
<p>This is another one of the properties of complex numbers and quaternions that it attempts to extend to all vectorial objects. It almost makes sense: <em>if</em> you are treating all multivectors as operators on other multivectors under multiplication, then naturally they have an inverse (if they are not an implementation of a projection) which is given by something that looks like division. I’m fine with that part. My objection is just that blurring the distinction between the multivectors and operators in the first place is weird, so inverting them is weird also. If you describe this as “inverting a vector”, it is mysterious and weird. If you describe it as “inverting an operator (which is implemented as a vector in this particular algebra)” it is completely intuitive. So just do that!</p>
<p>For instance a rotation operator \(R_{xy}: \b{v} \mapsto \b{v} \cdot (\b{x} \^ \b{y})\) has inverse \(R^{-1}_{xy}\), which is of course a rotation in the same plane with the opposite orientation, hence implemented as \(R^{-1}_{xy} = R_{yx} = R_{-xy}\). Meanwhile the inverse of \(\b{x} \^ \b{y}\) under the dot product is the object \((\b{x} \^ \b{y})/\|\b{x} \^ \b{y}\|^2\). Etc. This perspective seems a lot more orderly and sensible to me, and it makes it completely clear how each inverse object should work with no magic.</p>
<p>I will say that at least one type of vector division shows up all over math. It is basically part of the operation of “projection”:</p>
\[(\b{v} \cdot \b{a}^{-1}) \b{a} = (\frac{\b{v} \cdot \b{a}}{\| \b{a} \|^2}) \b{a} = \text{proj}_{\b{a}}(\b{v})\]
<p>It might seem weird to call this “division” since it does not exactly invert a particular multiplication operation. But I think it is a good <em>generalization</em> of division. In particular it has the correct behavior if the vectors are parallel, because \((a \b{x}) \cdot (b \b{x})^{-1} = a/b\), and in other cases its behavior is fairly easy to interpret: basically it divides the parallel parts and drops the non-parallel parts. (In other metrics there are also some concerns about zero-divisors, but whatever, just don’t try to invert those.)</p>
<p>Of course anything you might do with the \(\b{a}^{-1}\) notation you can also do without it,, but I think it’s rather elegant how it takes care of handling the factors of \(\| \b{a} \|\) for you. For instance it gives a neat way to factor a vector \(\b{v}\) into components along an orthogonal set of vectors \(\{\b{a}, \b{b}, \b{c}\}\) even if they are not unit vectors:</p>
\[\b{v} = (\b{v} \cdot \b{a}^{-1}) \b{a} + (\b{v} \cdot \b{b}^{-1}) \b{b} + (\b{v} \cdot \b{c}^{-1}) \b{c}\]
<p>Which may as well be written as</p>
\[\b{v} = v_a \b{a} + v_b \b{b} + v_c \b{c}\]
<p>By using this “vector division” instead of just a regular dot product, we cancel out the magnitude of the \(\{ \b{a}, \b{b}, \b{c} \}\) elegantly. In a way this is treating \(\{ \b{a}, \b{b}, \b{c} \}\) as a matrix \(\text{diag}(a,b,c)\) in a certain basis, then inverting it to get \(\text{diag}(a^{-1}, b^{-1}, c^{-1})\) in the same basis. (Of course, this only works due to the fact that the three vectors are orthogonal; otherwise you would get cross terms.)</p>
<p>Anyway, I am just trying to make the case that GA’s notion of vector division is not on its own necessarily a bad idea. The baisc construction shows up a lot in vector algebra, and if you are writing your operators as multivectors of course it is meaningful to invert them. But it gets confusing when you start conflating operators nad primitives and doing a bunch of algebra on the primitives.”Inverting a vector” is basically not meaningful, while “inverting a translation” is, and pretending like they are the same is pedagogically and philosophically unsound. But it seems to me like a perfectly sound interpretation is waiting just a few steps away.</p>
<hr />
<h1 id="3-summary">3. Summary</h1>
<p>I have given a lot of reasons why I think GA is problematic: the Geometric Product is a bad operation for most purposes. It really implements operator composition and is not a very fundamental or intuitive thing. Using a Clifford Algebra to implement geometry is an implementation detail, appropriate for some problems but not for general understandings of vector algebra and all of geometry. Giving it first-class status and then bizarrely acting like <em>that is not weird</em> is weird and alienating to people who can see through this trick.</p>
<p>Nor should we be trying to make everything look more like complex numbers and quaternions. Those are already weird and confusing; we should be moving away from them! Don’t call the geometric product “the” way to multiply vectors. Stop fixating on the geometric product or on some particular \(Cl_{p,q,r}\) that solves everything with a bunch of funky formula for basic stuff. Just teach wedge products and operators and keep it simple; stick to the good parts! Treat the Clifford Algebras as what they are: implementations of the compositions of particular operations in a particular notation. Not a replacement for the rest of geometry.</p>
<p>So for the time being I have to reject GA as a thing that I identify with. Fortunately, it is a philosophy, not a mathematical theory, so it’s easy to reject. That’s why when I write blog posts about the same basic ideas, and which align with the same basic philosophy of recasting mathematics in a more geometric and multivectorial form, I use the phrase “Exterior Algebra” instead.</p>
<p>That I said, I really do think there’s a lot more to discover here. I’m convinced that there’s some unifying theory of vector algebra that will tie this all together with a bow, and I’m hoping someone finds it, preferably soon. Among other things it will explain exactly why the geometric product <em>does</em> work, when it does, and also why so many other formulas end up looking suggestive and interesting and imply that for instance we can sometimes divide and multiply vectors like they’re numbers in a bunch of cases. When it does come along maybe we can call it “Geometric Algebra” again; it’s a good name. Or maybe “Geometric Algebra 2.0”, or “New Geometric Algebra”, or “Geometrical Algebra”. Or maybe we drag the name “Clifford Algebra” down out of the clouds and make it accessible to everyone. Whatever you want! But in the meantime, I’m not interested in using the name GA, because I think that name is attached to a bunch of bad ideas. The geometric product, its associated ontology, and the culture around it… are just mucking everything up. Thanks.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:ga" role="doc-endnote">
<p>Because some people disagree about this is <em>definitely</em> how the term “GA” is used in colloquial parlance, I swear. This is despite the fact that Clifford himself called the operation the “geometric product”. Quoting Hestenes: “Do not confuse Geometric Algebra (GA) with Clifford Algebra (CA)!” That said, this is all shifting over time, especially as the GA movement and the Clifford Algebra research world do more cross-pollination, and as more people learn about GA without interacting with the movement itself (particularly in computer graphics). <a href="#fnref:ga" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:story" role="doc-endnote">
<p>The story is from “Grassmann’s legacy” from the 2011 book “From Past to Future: Grassmann’s Work in Context”.Riesz’s lecture notes are available most readily in a 1993 book “Clifford Numbers and Spinors”. <a href="#fnref:story" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:traction" role="doc-endnote">
<p>Many details are available in Hestenes’ essay “The Genesis of Geometric Algebra: A Personal Retrospective”, although dang do I wish he would stop acting like all his work is the greatest stuff on earth. <a href="#fnref:traction" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:lounesto" role="doc-endnote">
<p>Lounesto is kind of hilarious. He’s apparently a very blunt Finnish guy who went around finding errors in everyone else’s publications about Clifford Algebras and collecting them all on <a href="https://users.aalto.fi/~ppuska/mirror/Lounesto/counterexamples.htm">his website</a>. <a href="#fnref:lounesto" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:complex" role="doc-endnote">
<p>The answer, at minimum, is at least trying to name what space they’re supposed to be rotations in, even if to just give it in a name like \(X\), and then write \(R_{X}\) instead of \(i\). Are all the \(i\)s of QM experimentally proven to be in the same space? I’m not sure anybody knows. Is it the \(U(1)\) gauge field of E&M? Your intro quantum book doesn’t mention it; it treats them as axiomatic. <a href="#fnref:complex" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:vixra" role="doc-endnote">
<p>For instance try searching <a href="https://vixra.org/">viXra</a>, that is, crank ArXiv, for the phrase “geometric algebra”. (Aside: if I ever have a beautiful Theory of Everything to share with the world it has occurred to me that it would be funny to post it on ViXra instead of somewhere reputable, just to confuse everyone. Don’t write them off completely!) <a href="#fnref:vixra" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:reddit" role="doc-endnote">
<p>For instance the most common stance on the r/math subreddit looks like <a href="https://www.reddit.com/r/math/comments/1ghw4b/why_isnt_geometric_algebra_more_widely_taught/">this one</a>: “From what I have seen, Geometric Algebra is just a rehashing of existing math.”. Which, yes, I agree, but the point is to make the existing math more intuitive, not to discover new results. The fact that research mathematics is generally <em>not</em> concerned with making calculation and intuition easier to think about is, I think, a giant failure that it will eventually regret. There’s as much value in making things easy to use as there is in discovering them. At this point probably more. Picture if nobody had started teaching non-mathematicians calculus because it was just for experts—it feels like that. <a href="#fnref:reddit" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:communication" role="doc-endnote">
<p>Well, communicating it at the right level and in the right notations is the trick. And also, arguably the theory <em>isn’t</em> quite there and that’s part of the problem, too. But with all these interested people surely we can work on that? <a href="#fnref:communication" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:quaternion" role="doc-endnote">
<p>The actual implementation of quaternions in geometric algebra regards it as the even subalgebra of the geometric algebra on \(\bb{R}^3\), with elements given by e.g. \(\b{i} = -I\b{x}\), that is, \(\b{i} = \b{zy}\), \(\b{j} = \b{xz}\), \(\b{k} = \b{yx}\). This is of course totally weird but it’s equivalent to how quaternions are implemented in Pauli matrices: \(\b{x} \mapsto -i \sigma_1\), etc. Quaternion multiplication follows from the GP: \(\b{ij} = (\b{zy})(\b{xz}) = \b{yx} = \b{k}\) and \(\b{ii} = (\b{zy})(\b{zy}) = -1\). But this mapping is basically arbitrary, and other mappings would also implement the same underlying algebra. <a href="#fnref:quaternion" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Rodrigues" role="doc-endnote">
<p>In general rotating with \(e^{\theta \b{B}}\) doesn’t work to rotate vectors, because—well look at it, it multiplies every term in the vector by either \(\cos \theta\) or \(\sin \theta\), and rotating a vector <em>should</em> leave an axis unchanged! The problem is that it only implements the rotation part of a rotation matrix, but not the \(1\) on the diagonal. Modeling rotations as rotors, on the other hand, handles things correct: \(R_{\theta}(\b{v}) = e^{i \b{B}/2} \b{v} e^{-i\b{B}/2}\). That’s also how quaternions do rotations correctly. Note the similarity to a change of basis \(A \ra P A P^{-1}\) in linear algebra. Some people treat these rotors as example of “spinors”, since they themselves rotate with only one rotor instead of two, which also makes people sometimes call spinors a sort of “square root of vectors”. <a href="#fnref:Rodrigues" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:units" role="doc-endnote">
<p>Actually, I think it is <em>exactly</em> the same as that, and whenever I finally write my book on all this I’m going to introduce the tensor algebra as a way of juggling multiple units at once rather than any sort of universal free multilinear multiplication… <a href="#fnref:units" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:definition" role="doc-endnote">
<p>Strangely, the texts on GA do a bad job of actually explaining that <a href="https://math.stackexchange.com/questions/444988/looking-for-a-clear-definition-of-the-geometric-product">that’s how it works</a>. Quoting MacDonald who wrote a well-known book on GA: “I do not think it possible to give a quick definition of the general geometric product.” Hmm. <a href="#fnref:definition" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:conjugation" role="doc-endnote">
<p>My rough understanding is that when \(\bb{R}\) gets algebraically completed by \(i = \sqrt{-1}\), there are really two possible values \(i\) and \(-i\) that satisfy \(i^2 = -1\). Therefore if we are solving any problem in \(\bb{R}\) with this value \(i\), the solution can’t care about the difference between \(+i\) and \(-i\), and you can interchange the two. That part is fine. But why, then, does multiplying \(z \bar{z}\) give a “magnitude” that works in a reasonable way? It is not very clear and certainly not natural compared to \(z \cdot z\). <a href="#fnref:conjugation" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:partial" role="doc-endnote">
<p>Aside: I wonder sometimes if EA is missing an operation in its toolbox which allows for contracting only <em>some</em> components of multivectors together while multiplying the rest, such that \((\b{x} \^ \b{y}) \cdot_1 (\b{x} \^ \b{y}) = - (\b{x} \o \b{x} + \b{y} \o \b{y}) = - I\). I wrote about this operation some <a href="(/2020/10/15/ea-operations.html)">here</a>, where I called it the “partial trace” because it is somewhat like <a href="https://en.wikipedia.org/wiki/Partial_trace">that operation</a> on tensors. But it is hard to think about because it clearly has to be able to create <em>non</em>-wedge product results (such as \(I\)), which are hard to incorporate into the overall algebra. (That is generally part of my stance on GA and EA both: there are some missing parts of the theory that are needed to make all the properties of vector algebra make sense, and they’re going to solve real intuitive problems a lot better than the GP does.) <a href="#fnref:partial" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Divergences and Delta Functions2023-10-24T00:00:00+00:00https://alexkritchevsky.com/2023/10/24/divergence<p>There’s an identity in electromagnetism which has been bugging me since college.</p>
<p>Gauss’s law says that the divergence of the electric field is equivalent to the charge distribution: \(\del \cdot \b{E} = \rho\). But in order to use this for a point charge—which is the most basic example in the subject!—we already don’t have the mathematical objects we need to calculate the divergence on the left or to represent the charge distribution on the right.</p>
<p>After all, the field of a point charge has to be \(\b{E} = q \hat{\b{r}}/4 \pi r^2\), and since its charge should be concentrated at a point it has to be a delta function: \(\del \cdot (q \hat{\b{r}}/4 \pi r^2) = q \delta(\b{x})\). In your multivariable-calculus-based E&M class you might mention this briefly, at best. Yet it is… kinda weird? And important? It feels like it should make a basic fact that lives inside of a larger intuitive framework of divergences and delta functions and everything else.</p>
<!--more-->
<p>Why, in the first place, are we using this divergence operator that we didn’t know how to actually calculate—are we missing something? Are there <em>other</em> divergences that we don’t know how to calculate? Does it work the same way in other dimensions? What about other powers of \(\frac{1}{r}\)? Are there other derivative <em>operators</em> we don’t know about that do similar tricks? Is there an equivalent version for the curl and by extension the magnetic field? Is there an equivalent version for dipoles, or multipoles? Etc. (The answer to all of these questions is ‘yes’, by the way.)</p>
<p>Not only is it unsatisfying, it’s also hard to learn about. For years I’ve been referring back to this one <a href="https://www.physicsforums.com/threads/divergence-of-the-e-field-at-a-theoretical-point-charge.956012/">rather confusing physicsforum.com post</a>, and I’m pretty tired of reading that. It’s not even good! Griffith’s and other E&M textbooks also mention it but they’re obscured by pedagogy and most of the interesting parts are left as exercises… and even then they don’t have much to say. Meanwhile venerable Wikipedia’s treatment is very slim and spread out over many hard-to-navigate articles; the best one is probably <a href="https://en.wikipedia.org/wiki/Green%27s_function_for_the_three-variable_Laplace_equation">here</a> but it’s still not great.</p>
<p>So today’s the day: I’m going to figure this out in all the generalization I want and write myself the reference I have wanted so I never have to visit that forum post, or that one page of Griffiths, ever again.</p>
<hr />
<h2 id="1-the-basic-argument">1. The Basic Argument</h2>
<p>The first thing we learn in electrostatics is that the electric field of a point particle is</p>
\[\b{E} = \frac{q \hat{\b{r}}}{4 \pi r^2}\]
<p>That is, the field points radially out in every direction from the ‘infinitely concentrated’ point charge, and the magnitude falls off proportional to \(4 \pi r^2\). Non-coincidentally, \(4\pi r^2\) is the formula for the surface area of a sphere of radius \(r\). Evidently electric flux lines get weaker exactly in proportion to how much they “spread out”. It is as though you had a pipe whose input has to be equal to its output, except the input is at the origin and the output is “every direction at once”. Put differently, an electric charge is the source of a flux and then that flux fluxes around in exactly the way a flux has to flux around, which is: conservatively. A source of nonzero electric flux is what a charge <em>is</em>.</p>
<p>Which means that you can detect the presence of charges by measuring the flux around a volume. This is Gauss’s Law: that summing the electric flux through any closed surface measures the total charge contained within it.</p>
\[\oiint_{S} \b{E} \cdot d \b{A} = q_{\text{enclosed}}\]
<p>The divergence theorem turns Gauss’s Law into</p>
\[\iiint_V \del \cdot \b{E} \; dV = q_{\text{enclosed}}\]
<p>We also learn the differential form of Gauss’s Law, which says that the divergence \(\del \cdot \b{E}\) equals the charge distribution \(\rho(\b{x})\). For a point particle the integral’s value is entirely concentrated at the origin, so \(\rho(\b{x})\) has to be a delta function:</p>
\[\rho(\b{x}) = q \delta(\b{x})\]
<p>But we also know the functional form of \(\b{E}\) for a point charge: it’s \(q \hat{\b{r}} /4 \pi r^2\). Hence at least in \(\bb{R}^3\) it must be true that:</p>
\[\del \cdot \frac{\hat{\b{r}}}{r^2} = 4 \pi \delta^3(\b{x})\]
<p>Equivalently:<sup id="fnref:laplacian" role="doc-noteref"><a href="#fn:laplacian" class="footnote" rel="footnote">1</a></sup></p>
\[- \del^2 \frac{1}{r} = 4 \pi \delta^3(\b{x})\]
<p>We can also write this delta function in terms of \(r\):<sup id="fnref:spherical" role="doc-noteref"><a href="#fn:spherical" class="footnote" rel="footnote">2</a></sup></p>
\[4 \pi \delta^3(\b{x}) = - \del^2 \frac{1}{r} = \del \frac{\hat{\b{r}}}{r^2} =\frac{\delta(r)}{r^2}\]
<p>Which is neat, and also rather suspicious-looking. Seems like the more interesting identity here is that \(\delta^3 (\b{x}) = \delta(r) / 4 \pi r^2\) where the numerator is the surface area of a \(2\)-sphere.</p>
<p>It’s pleasing (since it’s pleasing when any integral is easy) that you can simply plug that into the equation for the electric field of an arbitrary charge distribution and recover Gauss’s law:</p>
\[\begin{aligned}
\b{E}(\b{x}) &= \frac{1}{4 \pi} \int \frac{\rho(x')}{\|\b{x} - \b{x}' \|^2} dx' \\
\nabla \cdot \b{E}(\b{x}) &= \frac{1}{4 \pi} \int [\del \cdot \frac{1}{\|\b{x} - \b{x}' \|^2} ] \rho(x') \d \b{x}' \\
&= \frac{1}{4 \pi} \int [4 \pi \delta(\b{x} - \b{x}')] \rho(x') \d \b{x}' \\
&= \int \delta(\b{x} - \b{x}') \rho(x') \d \b{x}' \\
&= \rho(\b{x})
\end{aligned}\]
<hr />
<h2 id="2-the-other-definition-of-divergence">2. The Other Definition of Divergence</h2>
<p>Producing this result by working backwards from physics is good enough for most purposes, but it’s a bit perplexing. Maybe there’s a cleaner derivation?</p>
<p>I’ve looked around and there are some formal-ish <a href="https://math.stackexchange.com/questions/1335591/divergence-of-vecf-frac-hat-mathrmrr2">ways</a> to do it, by a procedure they call ‘regularizing’ \(\frac{\b{r}}{r^2}\) as a limit of a more complicated function like \(\b{r} /(r^2 + a^2)^{\frac{3}{2}}\), which is a way of producing distributions as a limit of non-distributions. I guess it’s rigorous, but I don’t want to do it. It doesn’t teach me anything new about divergences or delta functions at all. Plus it just feels unnecessary.<sup id="fnref:delta" role="doc-noteref"><a href="#fn:delta" class="footnote" rel="footnote">3</a></sup></p>
<p>Others <a href="https://math.stackexchange.com/questions/2136837/divergence-of-vecf-frac1r2-hatr">claim</a> that the divergence of \(\b{r}/r^2\) “is” undefined according to the usual definition, and that we’re just assigning a value to make the divergence theorem work. They’re obviously wrong: we’re not <em>inventing</em> a value; we’re <em>discovering</em> the actual value and it just requires delta functions to express. For the purposes of physics we don’t care at all about confining the space of objects we consider to just the standard-issue smooth functions. Evidently multivariable calculus <em>wants</em> distributions to get involved; we may as well let it happen.</p>
<p>The most satisfying explanation, in my opinion, is based on a different definition of divergence which isn’t used as much:</p>
<p>Recall that in multivariable calculus class we initially define divergence as a sum of partial derivatives \(\p_x \hat{\b{x}}+ \p_y \hat{\b{y}}+ \p_z \hat{\b{z}}\) (or whatever it becomes in other coordinate systems). But there’s another definition which is really a more direct extension of the one-variable derivative<sup id="fnref:derivative" role="doc-noteref"><a href="#fn:derivative" class="footnote" rel="footnote">4</a></sup>. It looks like this:</p>
\[\del \cdot \b{F} = \lim_{V \ra 0} \frac{1}{\| V \|} \oint_{\p V} \b{F} \cdot d\b{n}\]
<p>That is, it’s a ratio of the flux through a volume surrounding the point divided by the volume itself, as the volume goes to zero. It’s actually a standard definition and is at the top of the Wikipedia page on divergence, but for whatever reason it doesn’t come up as often. To use it, you compute the volume in the denominator as a sphere or a cube or whatever you want. For instance if \(\b{F} = x \hat{\b{x}} + y \hat{\b{y}} + z \hat{\b{z}}= r \hat{\b{r}}\) and we integrate over a sphere, then</p>
\[\begin{aligned}
\del \cdot \b{F} &= \frac{1}{4/3 \pi r^3} \oint (r \hat{\b{r}}) \cdot \hat{\b{r}} (r^2 d \theta d \phi) \\
&= \frac{4 \pi r^3}{4/3 \pi r^3} \\
&= 3 \\
&= (\p_x \hat{\b{x}}+ \p_y \hat{\b{y}}+ \p_z \hat{\b{z}}) \cdot (x \hat{\b{x}} + y \hat{\b{y}} + z \hat{\b{z}})
\end{aligned}\]
<p>In many ways this is more intuitive! On the other hand I have no idea how to prove that it’s equivalent to \(\p_x \hat{\b{x}}+ \p_y \hat{\b{y}}+ \p_z \hat{\b{z}}\) in general, and it’s hard to google for because you just get results about proving the divergence theorem. Sigh. But it makes some sense. \(\del \cdot \b{F} = (\p_x, \p_y, \p_z) \cdot \b{F}\) acts like the same formula but implemented on a cube instead of a sphere.</p>
<p>Using this definition we can derive the weird equation from E&M as follows. The flux of \(\hat{\b{r}}/r^2\) through a sphere of radius \(\e\) is \(4 \pi \e^2 / \e^2 = 4 \pi\) as long as the volume contains the origin and therefore the limit is \(\del \cdot \hat{\b{r}} / r^2 = \lim_{V \ra 0} 4 \pi 1_{\mathcal{O} \in V}/V\) which, if you integrate that against test functions, acts like \(4 \pi \delta(\b{x})\). Something like that.</p>
<hr />
<h1 id="3-dealing-with-deltar-in-other-dimensions">3. Dealing with \(\delta(r)\) in other dimensions</h1>
<p>One nice thing about the integral definition is that it makes generalizations of the delta function divergence to other dimensions very natural: just integrate over different types of objects. In each case the coefficient is given by the surface area of an \((n-1)\)-sphere of radius \(R=1\), which you can <a href="https://en.wikipedia.org/wiki/N-sphere">look up</a>.</p>
\[\begin{aligned}
\del \cdot \frac{\hat{\b{r}}}{r^{n-1}} &= S_{n-1} \delta(\b{x}) \\
&= \frac{\delta(r)}{r^{n-1}}
\end{aligned}\]
<p>e.g. in \(\bb{R}^2\) with polar coordinates (so \(\b{r}_{xy} = x \hat{\b{x}} + y \hat{\b{y}}\)):</p>
\[\del \cdot \frac{\hat{\b{r}}_{xy}}{r_{xy}} = 2 \pi \delta(x,y) = \frac{\delta(r_{xy})}{r_{xy}}\]
<p>Note that you can totally compute a 2-divergence in a plane in \(\bb{R}^3\), or a 3-divergence in \(\bb{R}^4\), etc. I guess we could write it as \(\del_{xy} \cdot \b{F}\). I do vaguely recall seeing that object in formulas in my life but can’t remember where.</p>
<p>In fact this construction works in \(\bb{R}^1\) also, but it’s kinda weird: the 1d version of \(\hat{\b{r}}/r^2\) in \(\bb{R}^3\) and \(\hat{\b{r}}_{xy}/r_{xy}\) in \(\bb{R}^2\) is \(\hat{\b{r}}_{x}\), the “one dimensional radius function”, also written less strangely as \(\sgn(x) \hat{\b{x}}\). That is, it’s a unit vector pointing in the \(+\b{x}\) direction in the positive numbers and the \(-\b{x}\) direction in the negative numbers. Then:</p>
\[\begin{aligned} \del \cdot (\hat{\b{r}}_x ) &= (\p_x \hat{\b{x}}) \cdot (\sgn(x) \hat{\b{x}}) \\
&= \p_x \sgn(x) \\
&= 2 \delta(x) \\
&= \delta(r_x)
\end{aligned}\]
<p>The factor of \(2\) can be regarded as the “surface area” of a \(0\)-sphere, that is, of a line segment. Admittedly it’s kinda weird to write \(2 \delta(x) = \delta(r_x)\). One way of thinking about it is that \(\hat{\b{r}}_x\) “acts like” \(\delta(x) \theta(x)\), covering only half the displacement at the origin with a step function, whereas \(\hat{\b{x}} (\sgn(x))\) acts like \(\delta(x) \sgn(x)\) and covers the full displacement. Hence the factor of \(2\).</p>
<p>Yes, that sounds weird and made up. I’m happy with it mostly I realized that it gives a satisfying result in in \(\bb{R}^3\) as well: recall that in spherical coordinates the radial term of the divergence looks like \(\del \cdot \b{f} = \frac{1}{r^2} \p_r (r^2 f_r)\). Well, suppose \(f = \hat{\b{r}} \theta(r) /r^2\) where once again we imagine that we need the \(\theta(r)\) in there to deal with how \(r\) switches signs at the origin. Then \(\del \cdot \b{f} = \frac{1}{r^2} \p_r [r^2 \frac{\theta(r)}{r^2}] = \frac{1}{r^2} \p_r \theta(r) = \delta(r)/r^2\) is the right value. Not bad, eh?</p>
<p>By the way, there is some information about all of this on the Wikipedia article for <a href="https://en.wikipedia.org/wiki/Newtonian_potential">Newtonian potential</a>. They call the function which is the fundamental solution to \(\del^2 f = \delta\) in \(\bb{R}^d\) the “Newtonian Kernel” \(\Gamma\), and write</p>
\[\Gamma(x) = \begin{cases}
2 \pi \log r & d = 2 \\
\frac{1}{d(2-d) V_d} r^{2 - d} & d \neq 2
\end{cases}\]
<p>Where \(V_d\) is the <em>volume</em> of the \(d\)-sphere. That’s a bit confusing. It’s easier to follow with the identity \(V_d = \frac{S_{d}}{d}\) where \(S_d\) is the surface area of the \(d\)-sphere. Then this is really</p>
\[\Gamma(x) = \begin{cases}
\frac{1}{2 \pi} \log r & d = 2 \\
\frac{1}{(2-d) S_d} r^{2 - d} & d \neq 2
\end{cases}\]
<p>And its gradient is given by the same formula in all dimensions:</p>
\[\del \Gamma(x) = \frac{1}{S_d} \frac{\hat{\b{r}}}{r^{d-1}}\]
<p>This agrees with what we wrote above, and even works in \(d=1\) if you consider the “surface area of the 1-sphere” to be \(S_1 = 2\).</p>
<hr />
<h1 id="4-other-shapes-and-multipoles">4. Other Shapes and Multipoles</h1>
<p>In fact every example charge distribution from elementary E&M has an expression as a delta function. It’s just that we’re not very… good… at using delta functions so we don’t normally write them that way.</p>
<p>Here are two classic examples from intro E&M, plus the electric fields that you get from applying Gauss’s law to their symmetries:</p>
<ul>
<li>an infinite line of charge in the \(z\)-direction with linear charge density \(\mu\) has electric field \(\b{E}(\b{x}) = \mu \hat{\b{r}}_{xy} / (2 \pi r_{xy})\).</li>
<li>an infinite plane of charge in the \(xy\) plane with area charge density \(\sigma\) has constant electric field \(\b{E}(\b{x}) = \sigma \sgn(z) \hat{\b{z}}/ 2\). The \(\sgn(z)\) makes this valid on both sides of the plane.</li>
</ul>
<p>In each case it should be that \(\del \cdot \b{E} = \rho(\b{x})\). Evidently:</p>
\[\begin{aligned}
\rho_{\text{line}}(\b{x}) &= \mu \delta(y, z) \\
&= \mu \frac{\delta(r_{xy})}{2 \pi r_{xy}} \\
\rho_{\text{plane}}(\b{x}) &= \sigma \delta(z) \\
&= \sigma \frac{\delta(r_z)}{2} \\
\end{aligned}\]
<p>Here are the forms of \(V\), \(\b{E}\), and \(\rho\) side-by-side:</p>
\[\begin{aligned}
\rho_{\text{line}}(\b{x}) &= \mu \delta(y, z) \\
\b{E}_{\text{line}}(\b{x}) &= \frac{\mu}{2 \pi} \frac{\hat{\b{r}}_{xy}}{ r_{xy}} \\
V_{\text{line}}(\b{x}) &= \frac{\mu}{2\pi} \ln {r_{xy}} \\
&\\
\rho_{\text{plane}}(\b{x}) &= \sigma \delta(z) \\
\b{E}_{\text{plane}}(\b{x}) &= \frac{\sigma}{2} \sgn(z) \hat{\b{z}} \\
V_{\text{plane}}(\b{x}) &= \frac{\sigma}{2} \| z \| \hat{\b{z}} \\
\end{aligned}\]
<p>How about some other interesting charge distributions?</p>
<p>A perfect <a href="https://en.wikipedia.org/wiki/Electric_dipole_moment">electric dipole</a> is the limiting case of a positive and negative charge next to each other, so that their net charge is zero but there is a nonzero dipole moment \(\b{p}\) along a certain axis axis. The potential, electric field, and charge distributions of dipoles are given by the limit as we press two point charges together while keeping the product \(qd\) fixed. But in fact this limit is just a directional derivative:</p>
\[\rho_{\text{dipole}}(\b{x}) = \lim_{d \ra 0} [(+ q)\delta(\b{x} - \b{d}/2) + (- q)\delta(\b{x} + \b{d}/2)] = -\b{p} \cdot \p [ \delta(\b{x})] = -\p_{\b{p}} \delta(\b{x})\]
<p>So the charge distribution of a dipole is the gradient of a delta-function. That makes sense: the net charge is zero, but there’s two infinite spikes at the origin infinitesimally close to each other, which is what \(\delta'\) looks like.</p>
<p>We can immediately write down the electric field and potential also:</p>
\[\begin{aligned}
V_{\text{dipole}}(\b{x}) &= -\p_{\b{p}} [ \frac{1}{4 \pi r}] = \b{p} \cdot [\frac{\hat{\b{r}}}{4 \pi r^2} ]\\
\b{E}_{\text{dipole}}(\b{x}) = -\del V_{\text{dipole}}(\b{x}) &= -\p_{\b{p}}[ \frac{ \hat{\b{r}}}{4 \pi r^2}] = \frac{3 (\b{p} \cdot \hat{\b{r}}) \hat{\b{r}} - \b{p}}{4 \pi r^3} \\
\rho_{\text{dipole}}(\b{x}) = \del \b{E}_{\text{dipole}}(\b{x}) &= -\p_{\b{p}} [\delta(\b{x})] \\
\end{aligned}\]
<p>(Although see the next aside for a correction to \(\b{E}\): there’s apparently supposed to be a delta function term there also.)</p>
<aside id="dipole" class="toggleable" placeholder="<b>Aside</b>: The Dipole Field Discrepancy <em>(click to expand)</em>">
<p>By the way. While we’re talking about dipoles and delta functions. Remember how the dipole term in \(\b{E}\) was the second derivative of \(\frac{1}{4 \pi r}\)?</p>
\[\b{E}_{\text{dipole}} = \frac{3 (\b{p} \cdot \hat{\b{r}}) \hat{\b{r}} - \b{p}}{4 \pi r^3}\]
<p>It turns out there is some debate in the physics world about whether this should have a delta function term attached to it and what the coefficient should be:</p>
\[\b{E}_{\text{dipole (corrected?)}} \stackrel{?}{=} \frac{1}{4 \pi } [ \frac{3 (\b{p} \cdot \hat{\b{r}}) \hat{\b{r}} - \b{p}}{ r^3}] - \frac{1}{3} \b{p} \delta(\b{x})\]
<p>Griffiths and Jackson, the pre-eminent textbooks, both say it should look like that. The argument is that if you integrate the electric field \(\int \b{E}(\b{x}) d^3 \b{x}\) over a region containing a dipole, it is off: you should get that the total field is \(-\frac{1}{3} \b{p}\), but instead you get that it’s zero as long as you exclude the origin. (The \(\frac{1}{3}\) is really \(\frac{1}{4 \pi} \times \frac{4 \pi}{3}\), the second term being the volume of a unit sphere.)</p>
<p>But when you go looking to read about this correction, people are pretty polarized (no pun intended). <a href="https://iopscience.iop.org/article/10.1088/0143-0807/28/2/012/meta">This</a> delightful paper by Andre Gsponer (not a typo) argues that the problem is that nobody is very good at using the \(r = \| \b{r} \|\) variable, which (as I also noticed earlier) has a derivative of \(\sgn(r)\) at \(r = 0\); hence, its second derivative produces a delta function at the origin. In particular, they argue that the actual potential of a point charge goes as</p>
\[V(\b{x}) = \frac{1}{4 \pi r} \sgn(r)\]
<p>Or equivalently:</p>
\[V(\b{x}) = \frac{1}{4 \pi \| r \|}\]
<p>since \(r \, \sgn (r) = \frac{r}{\sgn (r)} = \| r \|\). The \(\sgn(r)\) hangs out even though it’s always positive in order to give a correct derivative later.</p>
\[\begin{aligned}
\del \frac{1}{\| r \|} &= \p_r (\frac{1}{r} \, \sgn (r)) \\
&= - \frac{\hat{\b{r}}}{r^2} \sgn(r) + 2 \frac{\hat{\b{r}}}{r} \delta(r) \\
\del^2 \frac{1}{\| r \|} &= \frac{1}{r^2} \p_r [r^2 (- \frac{1}{r^2} \sgn(r) + 2 \frac{1}{r} \delta(r))] \\
&= \frac{1}{r^2} \p_r [- \sgn(r) + 2 r \delta(r)] \\
&= \frac{1}{r^2} [- \delta(r) + \cancel{2 \delta(r) + 2 r \delta'(r)}] \\
&= - \frac{1}{r^2} \delta(r)
\end{aligned}\]
<p>(Note that the radial part of the divergence is given by \(\del \cdot f = \frac{1}{r^2} \p_r(r^2 f_r)\), and also that \(x \delta'(x) = - \delta(x)\).)</p>
<p>The dipole version is:</p>
\[\begin{aligned}
\del \p_{\b{p}} \frac{1}{\| r \|} &= \del [ - \frac{\b{p} \cdot \b{\hat{r}}}{r^2} \sgn (r)] \\
&= \frac{3 (\b{p} \cdot \b{r})(\b{r}) - r^2\b{p} }{r^5} \sgn(r) - \frac{(\b{p} \cdot \b{r}) \b{\hat{r}}}{r^3} \delta(r) \\
&= \frac{3 (\b{p} \cdot \b{r})(\b{r}) - r^2\b{p} }{r^5} \sgn(r) - \frac{\b{p}}{r^2} \delta(r) \\
\end{aligned}\]
<p>It’s that last term \(- \frac{\b{p}}{r^2} \delta(r)\) which gives the discrepancy: when integrated over a sphere the \(1/r^2\) cancels out the \(r^2\) integration factor so the result is just the volume of the sphere, \(\frac{4 \pi}{3}\), leading to \(-\frac{\b{p}}{3} \delta(r)\). So there you go. Apparently there should be delta functions on \(\b{E}\) fields also, and it’s the missing \(\sgn(r)\)s that are causing us to lose track of our deltas. Who knew?</p>
<p>Also, fun fact: apparently Jackson, who wrote that one textbook everyone knows, also published a <a href="http://cds.cern.ch/record/118393?ln=en">paper</a> arguing that the fact that <em>intrinsic</em> dipoles have a different delta function term (\(+ \frac{8 \pi}{3}\) instead of \(- \frac{4 \pi}{3}\), he says) compared to dipoles that are the limit of two monopoles shows that distant stars must have magnetic dipoles (that is, circulating electric currents) rather than magnetic monopoles in them, or they’d have a 42cm spectral line instead of a 21cm spectral line. Weird. I didn’t really follow it.</p>
<p>There are some other weird papers around the subject:</p>
<ul>
<li><a href="https://arxiv.org/pdf/1604.01121.pdf">This</a> paper by Edward Parker discusses various ways to get the terms in Jackson’s argument.</li>
<li><a href="https://pubs.aip.org/aapt/ajp/article-abstract/51/9/826/1043129/Some-novel-delta-function-identities?redirectedFrom=fulltext">Some novel delta‐function identities</a> by Charles Frahm derives some of these equations with explicit calculations in indexes.</li>
<li><a href="https://arxiv.org/abs/1001.1530">Comment on “Some novel delta-function identities”</a> by Jerrold Franklin thinks that Frahm did it wrong and does it a different way. They do explicitly claim that the \(-\p^2 (\frac{1}{r}) = 4 \pi \hat{\b{x}}^{\o 2}\delta(\b{x})\), though, and that everyone else has been integrating over the angular dependence implicitly.</li>
<li>And then there’s <a href="https://arxiv.org/abs/1308.2262">Comment on “Comment on `Some novel delta-function identities”</a> by Yunyun Yang and Ricardo Estrada… but unfortunately ArXiv doesn’t have the pdf. I think they took it down because it was an older version and they changed the name later: the actual paper is called <a href="https://repository.lsu.edu/cgi/viewcontent.cgi?article=1282&context=mathematics_pubs">Distributions in spaces with thick points</a>, which deals with everything more rigorously than I care for and honestly gets crazy in how complex it is, defining distributions on certain surfaces and a new kind of “thick” delta functions. Why is figuring out what happens at \(r=0\) in \(\bb{R}^3\) so hard?</li>
</ul>
<p>Math is horrifying, but this chain of commentaries is kinda funny. Out of all of these I think the \(\frac{1}{r} \ra \frac{1}{r} \sgn(r)\) trick is the most useable. Probably best to stay away from “thick distributions” for now.</p>
<p>In summary:</p>
\[\b{E}_{\text{dipole}} \stackrel{?}{=} \frac{1}{4 \pi } [ \frac{3 (\b{p} \cdot \hat{\b{r}}) \hat{\b{r}} - \b{p}}{ r^3}] + \begin{cases} - \frac{1}{3} \b{p} \delta(\b{x}) & \\ + \frac{2}{3} \b{p} \delta(\b{x}) \end{cases} \text{ (depending on who you ask)}\]
</aside>
<p>Another way of looking at dipoles is to consider manually placing a bunch of charges at a distance \(h\) apart and then taking \(h \ra 0\). Write \(\Delta_\b{v}\) for a finite difference at a distance \(h\) along the \(\b{v}\). For instance \(\Delta_{\b{v}} = f(\b{x} + \b{v} h) - f(\b{x})\). Note that \(\p_\b{v} f(\b{x}) = \lim_{h \ra 0} \Delta_\b{v} f(\b{x})\). Also, we can write \(T_\b{z} f \equiv f(\b{x} + \b{z} h)\), such that \(D_\b{z} = T_\b{z} - 1\) and \(D_\b{z} f = (T_\b{z} - 1) f = T_\b{z} f - f\).</p>
<p>Then a “physical” dipole (where the charges are a small but finite distance apart) is proportional to</p>
\[\Delta_\b{z} \delta(\b{x}) = \delta(\b{x} + \b{z} h) - \delta(\b{x})\]
<p>Then the infinitesimal dipole charge distribution is given by \(\rho(\b{x}) = - \frac{q}{h} \Delta_\b{z} \delta(\b{x})\), which in the limit where \(h \ra 0\) with \(hq = p\) fixed gives</p>
\[\rho_{\text{dipole}}(\b{x}) = (- p ) \p_\b{z} \delta(\b{x}) = (-p \b{z}) \cdot \p \delta(\b{x})\]
<p>A physical quadrupole is given by the “second finite difference” (so, second derivative). We can consider the case along a single axis:</p>
\[\begin{aligned}
\Delta_\b{z}^2 \delta &= (T_\b{z} - 1)^2 \delta \\
&= T_\b{z}^2 \delta - 2 T_\b{z} \delta + \delta \\
&\equiv \delta(\b{x} + 2h \b{z}) - 2 \delta(\b{x} + h \b{z}) + \delta(\b{x})
\end{aligned}\]
<p>In the limit we take \(h \ra 0\) while holding \(q^2 h = Q\) and get \(\rho_{\b{z}\b{z}\text{-quadrupole}} = Q \p_\b{z}^2 \delta = \hat{Q} \cdot \p^2 \delta\) (where \(\hat{Q}\) is a quadrupole tensor which only has an \(\b{z}\b{z}\) component). Or, we can do an \(\b{y}\)-\(\b{z}\) quadrupole:</p>
\[\begin{aligned}
\Delta_\b{y} \Delta_\b{z} \delta (\b{x}) &= (T_\b{y} - 1)(T_\b{z} - 1) \delta \\
&= T_\b{y} T_\b{z} \delta - T_\b{y} \delta - T_\b{z} \delta + \delta \\
&\equiv \delta(\b{x} + h(\b{z} + \b{y})) - \delta(\b{x} + h \b{xz}) - \delta(\b{x} + h \b{z}) + \delta(\b{x})
\end{aligned}\]
<p>The limit with \(q^2 h = Q\) is \(\rho_{yz\text{-quadrupole}} = Q \p_y \p_z \delta(\b{x}) = \hat{Q} \cdot \p^2 \delta\) (where \(\hat{Q}\) is now a quadrupole tensor which only has an \(yz\) component).</p>
<p>This construction is nicely easy to generalize, for instance to any charge distribution that’s a mix of points and multipoles at any separation from each other.</p>
<p>We can also make lines and planes and other shapes out of multipoles. For instance a “line of dipoles” looks like a positively-charged line infinitesimally close to a negatively-charged line. The result is just that we take an additional \(-\p\) of every term, which is equivalent to forcing the two charged surfaces to be \(d \ra 0\) apart with opposite signs while holding the ratio \(qd\) constant. For instance a line of charge on the \(z\) axis had charge distribution \(\rho_{\text{line}} = \mu \delta(x,y)\). A dipole along the \(x\)-axis made out of two of thsee has charge distribution</p>
\[\rho_{\text{line of dipoles}} = \lim_{h \ra 0, h\mu = p} [\rho_{\text{line}}(\b{x} + h \hat{x}) - \rho_{\text{line}}(\b{x})] = p \p_x \delta(x,y)\]
<p>More generally, the <a href="https://en.wikipedia.org/wiki/Multipole_expansion">multipole distribution</a> of a potential gives an Taylor expansion of \(V\) away from the charges, in terms of the increasingly higher-order moments \((q, \b{p}, \hat{Q}, \ldots)\) of the underlying charge distribtion.<sup id="fnref:quad" role="doc-noteref"><a href="#fn:quad" class="footnote" rel="footnote">5</a></sup></p>
\[\begin{aligned}
V(\b{x})_{\text{multipole}} &= \frac{1}{4 \pi} [\frac{q}{r} + \b{p} \cdot \frac{\hat{\b{r}}}{r^2} + \frac{1}{2} \hat{Q} \cdot \frac{\hat{\b{r}}^{\o 2} }{r^3} + \ldots] \\
&=\frac{1}{4 \pi r} [q + \b{p} \cdot (-\p) + \frac{1}{2} \hat{Q} \cdot (-\p)^2 + \ldots]\\
\end{aligned}\]
<p>Each of these terms is going to have a \(-\del^2\) that gives a delta-function derivative of some order.</p>
\[\begin{aligned}
-\del V = \b{E}(\b{x})_{\text{multipole}} &= [q + \b{p} \cdot (-\p) + \frac{1}{2} \hat{Q} \cdot (-\p)^{2} + \ldots] \frac{\hat{\b{r}}}{4 \pi r^2} \\
-\del^2 V = \rho(\b{x})_{\text{multipole}} &= [q + \b{p} \cdot (-\p) + \frac{1}{2} \hat{Q} \cdot (-\p)^{2} + \ldots] \delta^3(\b{x})
\end{aligned}\]
<hr />
<h2 id="5-other-powers-of-r">5. Other powers of \(r\)</h2>
<p>The multipole examples imply that in general, there are lots of objects in \(\bb{R}^3\) that have delta function divergences and it’s not just \(\hat{\b{r}}/r^2\), but the results are going to involve <em>derivatives</em> of delta functions instead… which are even harder to detect with the usual implementations of divergence.</p>
<p>For instance we can can compute \(\del \cdot \hat{\b{r}}/r^3\) in two ways. Everywhere except the origin, we can use \(\del \cdot f(r) = \frac{1}{r^2} \p_r (r^2 f_r)\) to get</p>
\[\del \cdot \frac{\hat{\b{r}}}{r^3} = - \frac{1}{r^4}\]
<p>And around the origin we use the integral definition of divergence:</p>
\[\begin{aligned}
\del \cdot \frac{\hat{\b{r}}}{r^3} &= \lim_{R \ra 0} \oint_{R} \frac{1}{r} \frac{d \Omega}{ r^2} \\
&= \frac{4 \pi }{r} \delta^3(\b{x}) \\
&= \frac{\delta(r)}{r^3} \\
\end{aligned}\]
<p>It seems like we should be able to make the delta function into a derivative, similar to what showed up in the multipole distribution. But it’s a little weird. Normally we can replace \(\delta/x^n\) with \(\frac{(-1)^n}{n!} \delta^{(x)}\). But it seems like the identity is probably a little bit different in radial coordinates, since after all we expect this to be true:</p>
\[\frac{4 \pi }{r} \delta^3(\b{x}) = \frac{1 }{r} \frac{\delta(r)}{ r^2} = \frac{- \delta'(r)}{r^2}\]
<p>That is, \(\frac{1}{r^3}\) should give a <em>first</em> derivative, not a <em>third</em> derivative:</p>
\[\frac{4 \pi }{r} \delta^3(\b{x}) = \frac{1 }{r} \frac{\delta(r)}{ r^2} \stackrel{!}{\neq} \frac{- \delta^{(3)}(r)}{3!}\]
<p>The problem, I presume, is basically that \(\p_r \delta(r)\) is a weird object, because in an integral against a test function \(\< -\p_r \delta(r), f \>\), the normal integration-by-parts that lets us move the derivative across doesn’t work: \(\< -\p_r \delta(r), f \> \neq \< \delta, \p_r f \>\): since the radial integral has bounds \((0,r)\), we <em>can’t</em> ignore the boundary. This integration by parts is what justifies \(\delta(x)/x = -\p_x \delta(x)\) normally, since \(\< \delta(x)/x, f \> = \< \delta, f/x \> = - f'(0)\) (in a principal-value sense?). Therefore it is <em>probably</em> best to leave \(\delta(r) / r^3\) as-is instead of trying to turn it into a radial derivative.</p>
<p>Nevertheless I’m pretty sure there are ways to do it, but it’s a lot more than I want to figure out right now. Roughly speaking, though, we can expect that a term like \(\delta(r)/r^k\) is going to turn into a delta-function derivative that is comparable to \(\frac{1}{r^2} \frac{(-1)^{k-2}}{(k-2)!} \delta^{(k-2)}\), that is, it will act like a delta-derivative but it acts like a factor of \(\frac{1}{r^{(k-2)}}\) instead. But I hope to figure out the actual details in a future article.</p>
<hr />
<h2 id="6-curl-and-magnetic-fields">6. Curl and Magnetic Fields</h2>
<p>One last question. How does this work for magnetism and curl?</p>
<p>The equivalent Maxwell equation is Ampère’s law, which establishes that the curl of the <em>magnetic</em> field is proportional to the current density (in units with \(\mu_0 = 1\)):</p>
\[\del \times \b{B} = \b{J}\]
<p>The integral form is:</p>
\[\oint_{\p A} \b{B} \cdot d \ell = \iint_{A} \b{J} \cdot d\b{A}\]
<p>Like divergence, there’s an integral form for the curl, which is basically the same idea except that it is computed each plane instead of in the overall volume. The coordinates on the plane with normal \(\b{u}\) are given by:</p>
\[(\del \times \b{F} )\cdot \hat{\b{u}} = \lim_{A \ra 0} \frac{1}{\| A \|} \oint_{\p A} \b{F} \cdot d \ell\]
<p>We can use this to justify delta-function formulas for currents on various surfaces, although I’m going to skip most of the steps. The equivalent identity is going to be for a current which is entirely concentrated in a line, which we’ll assume is at the origin of the \((x,y)\) plane and directed up the \(z\) axis.</p>
\[\b{J} = j \hat{\b{z}} \delta(x,y) = j \hat{\b{z}} \frac{\delta(\b{r}_{xy})}{r_{xy}}\]
<p>(That’s in \((r_{xy}, \theta, z)\) cylindrical coordinates; the \(\frac{1}{r_{xy}}\) factor is the same as the one for the 2d divergence up above.)</p>
<p>Of course the magnetic field due to a infinitely thin wire is a basic textbook example, so we know immediately what function has this as its curl:</p>
\[\begin{aligned}\b{B} &= j\frac{\hat{\theta}}{2 \pi r_{xy}} \\
\b{J} = \del \times \b{B} &= j \hat{\b{z}} \delta(r_{xy}) \end{aligned}\]
<p>The \(\hat{\theta}/r_{xy}\) vector field is the classic ‘twist around the origin’ vector field that points along \(\hat{\theta}\) and at a right angle to \(\hat{\b{r}}_{xy}\) everywhere. It might look more familiar in cartesian coordinates:</p>
\[\hat{\theta} = \frac{x \hat{\b{y}} - y \hat{\b{x}}}{\sqrt{x^2 + y^2}} = \frac{x \hat{\b{y}} - y \hat{\b{x}}}{r_{xy}}\]
<p>Then its curl is:</p>
\[\del \times \frac{\hat{\theta}}{r_{xy}} = 2 \pi \hat{\b{z}} \delta(x, y) = \hat{\b{z}} \frac{ \delta(r_{xy})}{r_{xy}}\]
<p>That’s neat, I guess.</p>
<p>We can also do the magnetic field due to a single magnetic dipole (an infinitesimal magnetic or loop of current or classically-interpreted particle with spin) with magnetic dipole moment \(\b{m}\). We’ll use Gsponer’s notation of including a \(\sgn(r)\) term and see if it gives us some good delta functions terms. The vector potential is:</p>
\[\b{A} = \frac{1}{4\pi} \frac{\b{m} \times \b{r}}{r^3} \sgn(r)\]
<p>The magnetic field is (yes, definitely had to look up some identities for this):</p>
\[\begin{aligned}
4 \pi \b{B} = 4 \pi \del \times \b{A} &= \del \times [(\b{m} \times \frac{\b{r}}{r^3}) \sgn(r)] \\
&= [(- \b{m} \cdot \del) \frac{\b{r}}{r^3} + \b{m} (\cancel{\del \cdot \frac{\b{r}}{r^3}})] \sgn(r) - \frac{\b{m} \times \b{r}}{r^3} \times \del \sgn(r) \\
&= [\b{m} \cdot (-\frac{1}{r^3} + \frac{3 \b{r}^{\o 2}}{r^5}) ] \sgn(r) - \frac{\b{m} \times \b{r}}{r^3} \times \hat{\b{r}} \delta(r) \\
\b{B} &= \frac{3 (\b{m} \cdot \b{r}) \b{r} - r^2 \b{m}}{4 \pi r^5} \sgn(r) + (\hat{\b{r}} \times \b{m} \times \hat{\b{r}}) \frac{\delta(r)}{4 \pi r^2}
\end{aligned}\]
<p>(Since we included the \(\sgn\) term that should track the delta functions for us, it seemed like \(\del \cdot \frac{\b{r}}{r^3} = \del \cdot \frac{\hat{\b{r}}}{r^2}\) could be ignored now.) The latter term is the delta-function correction to the magnetic dipole field. The internet tells me that its integral over space is \(+\frac{8 \pi}{3} \b{m}\), compared to \(-\frac{4 \pi}{3} \b{p}\) for the scalar dipole field, and, as mentioned earlier, Jackson says this is responsible for the specific wavelength in the hyperfine splitting of hydrogen. Weird.</p>
<hr />
<h2 id="7-summary">7. Summary</h2>
<p>Most of these equations are the same toy examples from an intro electromagnetism course, written in a different way. But it is satisfying to see them written “explicitly”, which is what the delta functions let us do, instead of “working around” the delta function formulation by computing with e.g. Gauss’s law. I think things would have been easier to learn, back then, if the delta-function forms of these objects were made explicit from the start.</p>
<p>For posterity here’s a summary of the talked we’ve talked about:</p>
<p><strong>Delta Functions in Radial Coordinates</strong></p>
\[\begin{aligned}
\delta^n(\b{x}) &= \delta(r_n)/ S_{n-1} \\
\delta^3(\b{x}) &= \delta(r_3)/4 \pi r^2 \\
\delta^2(\b{x}) &= \delta(r_2)/2 \pi r \\
\delta(x) &= \delta(r_1)/2 \\
\end{aligned}\]
<p>For the \(1d\) case, recall that the “0-sphere” is a line segment whose “surface” area is usefully understood to be \(S_0 = 2\), giving \(\delta(x) = \frac{1}{2}\delta(r)\). That does seem a bit weird—why are we chopping our delta function in two?—but it does seem to work.</p>
<p><strong>Integral Form of Divergence and Curl</strong></p>
<p>Divergence in general is given by</p>
\[\del \cdot \b{F} = \lim_{V \ra 0} \frac{1}{\| V \|} \oint_{\p V} \b{F} \cdot d \b{n}\]
<p>Curl is given by</p>
\[(\del \times \b{F} )\cdot \hat{\b{u}} = \lim_{A \ra 0} \frac{1}{\| A \|} \oint_{\p A} \b{F} \cdot d \ell\]
<p>for any plane with normal \(\b{u}\); chose \(\b{u} = \{ \b{x}, \b{y}, \b{z} \}\) to get the usual vector projections.</p>
<p><strong>Functions whose divergence/curl/exterior derivative are delta functions</strong></p>
<p>In \(\bb{R}^n\) the divergence’s integrand includes a factor of \(r^{n-1}\) from the coordinates, while all the angular coordinates naturally integrate to \(S_{n-1}\). Therefore it’s \(1/r^{n-1}\) that cancels the radial part out and produces a delta function at the origin:</p>
\[\del \cdot \frac{\hat{\b{r}}}{r^{n-1}} = S_{n-1} \delta^n (\b{x}) = \frac{\delta(r)}{r^{n-1}}\]
<p>Which in \(\bb{R}^3\) is</p>
\[\del \cdot \frac{\hat{\b{r}}}{r^{2}} = 4 \pi \delta^3 (\b{x}) = \frac{\delta(r)}{r^2}\]
<p>Meanwhile if curl is integrated around a loop in e.g. the \(\b{xy}\) plane, then the integrand includes a factor of the radius \(\rho\) in that plane and is therefore canceled out by \(\rho^{-1}\).</p>
\[\del \times \frac{\hat{\theta}}{\rho^{-1}} = (0, 0, 2 \pi \delta(x, y)) = (0, 0, \frac{\delta(\rho)}{\rho})\]
<p>The analogs on other dimensions of \(n\)-spheres can be used to generalize these to higher dimensions, or to charge or current distributions that take lower-dimensional forms like lines or planes of charge.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:laplacian" role="doc-endnote">
<p>Sorry but I am stubbornly opposed to the Laplacian symbol \(\Delta = \del^2\) <a href="#fnref:laplacian" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:spherical" role="doc-endnote">
<p>By the way (because I definitely didn’t know this off the top of my head) you can’t just replace \(\delta(\b{x})\) with \(\delta(r)\). Translating \(\delta(\b{x})\) to spherical coordinates requires some extra care because it has to be true that \(\int_V \delta(x,y,z) d^3 \b{x} = \int_V \delta(r) (r^2 \sin \theta) \, dr \, d \theta \, d\phi\). The two angular integrals integrate to \(4 \pi\), so \(\delta(r)\) has to have a \(\frac{1}{4 \pi r^2}\) to cancel everything out. <a href="#fnref:spherical" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:delta" role="doc-endnote">
<p>In general there are two ways of defining delta functions: either you make them out of a limit of functions you know how to do analysis on and prove that the limit is well-defined, which is this regularization procedure… or you define them to have certain properties by fiat, and then show that they exist. The latter, IMO, is the “right” way. I think the approximations are only to satisfy people who are unnecessarily fixated on classical functions that have definite values at points. (There’s a rather nice book called “Theory of Distributions: A Non-Technical Introduction” by Richards & Youn which I like because of how much it commits to the better approach.) <a href="#fnref:delta" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:derivative" role="doc-endnote">
<p>The sense in which this is a 1d derivative: \(1/V\) can be written as \(\int dV\), with the integral is over a ball of radius \(\e\). In the numerator it’s over the boundary of that ball. So divergence is \(\underset{e \ra 0}{\lim} [\int_{ \p B_\e} f \, dA] / (\int_{B_\e} dV) .\) In one dimension a ball of radius \(\e\) is just a line segment, so this is literally the same as the 1d derivative \(\lim_{\e \ra 0} \frac{f(x + \e) - f(x)}{\e}\). <a href="#fnref:derivative" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:quad" role="doc-endnote">
<p>\(\hat{Q}\) here is the rank-2 <a href="https://en.wikipedia.org/wiki/Quadrupole">quadrupole tensor</a>. Equations using it and higher-order multipoles are best unpacked in index notation: \(\p^2_{ \hat{Q}} \delta(\b{x}) = Q^{ij} \p_i \p_j \delta(\b{x})\). By the way, I haven’t learned a ton about \(\hat{Q}\), and I’m a bit confused about when it ought to have a factor of \(1/2\) or not. It might be a convention. Definitely double-check before using this. <a href="#fnref:quad" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>
Delta Function Miscellanea2023-10-14T00:00:00+00:00https://alexkritchevsky.com/2023/10/14/delta-fns<p>Here’s some stuff about delta functions I keep needing to remember, including:</p>
<ul>
<li>the best way to define them</li>
<li>how \(\delta(x)/x = - \delta'(x)\)</li>
<li>possible interpretations of \(x \delta(x)\)</li>
<li>some discussion of the \(\delta(g(x))\) rule</li>
<li>how \(\delta(x)\) works in curvilinear coordinates.</li>
</ul>
<!--more-->
<hr />
<h2 id="1-definitional-stuff">1. Definitional Stuff</h2>
<p><strong>Quibbles about Definitions</strong></p>
<p>I don’t like the way most books introduce delta functions. IMO, if all the ways of defining something give rise to the same properties, then that object “exists” and you don’t really need to define it in terms of another object. Sure, you can construct a delta (distribution) as a limit of Gaussians fixed a fixed integral or whatever, but why would you? \(\int \delta(x) f(x) \, dx = f(0)\) is just fine. (Well, you do have to include some other properties to ensure that \(\< \delta', f \> = - \< \delta, f \>\), but that’s not important.)</p>
<p>The most common definition in physics is the definition in terms of the Fourier transform:</p>
\[\delta(k) = \frac{1}{2 \pi}\int e^{-ikx} dx\]
<p>And I would emphasize that that is just an identity, not a definition, similar to \(\sin^2 + \cos^2 = 1\).</p>
<p>I also don’t care at all about the use of the word “function” vs. “generalized function” vs. “distribution”. For my purposes, everything is a distribution and demanding a value at a point is (possibly) a mistake. I imagine that in the far future we will use the word “function” for all of these things starting in high school and nobody will care.</p>
<hr />
<p><strong>Fourier Transform Interpretation</strong></p>
<p>The Fourier transform of \(f(x)\) is given by:</p>
\[\hat{f}(k) = \int f(x) e^{-ikx} dx\]
<p>One interpretation of the Fourier transform is something like:</p>
<blockquote>
<p>\(e^{-ikx}\) is an orthogonal basis for frequency-space functions. We write \(f\) In this basis \(\hat{f}(k)\) by projecting \(f(x)\) onto each component of basis by taking an inner product \(\< f(x), e^{-ikx}\> = \int f(x) e^{-ikx} dx\).</p>
</blockquote>
<p>That’s a pretty good definition, and one that I hold dearly because it took me a while to figure out in college, but I think there’s an even better interpretation waiting in the wings. Something like:</p>
<blockquote>
<p>A function \(f\) is a generic object that doesn’t know anything about our choices of bases. The position-space implementation \(f\) is simply \(f\) written out in the position basis. The Fourier Transform of \(f(x)\) is \(f\) evaluated at \(\hat{k}\), where \(\hat{k}\) is a frequency-value rather than a position value, but the two bases live on equal footing and we can treat either as fundamental.</p>
</blockquote>
<p>It just so happens that it’s implemented as an integral transform. In particular, the transform is kinda like computing \(f \ast \delta(\hat{k})\), where the convolution acts like an operation that projects objects into different bases. whatever that means. We could imagine expressing both \(f\) and \(\delta(\hat{k})\) in a <em>third</em> basis, neither position nor frequency, and that operation should still make sense.</p>
<hr />
<h2 id="2-derivatives-of-delta-act-like-division">2. Derivatives of \(\delta\) act like division</h2>
<p>I always end up needing to look this up.</p>
<p>The rule for derivatives of delta functions are most easily found by comparing their Fourier transforms. Since we know the that \(\p_x^n = n x^{n-1}\) we can compare, using \(\F(x f) = i \p_k \hat{f}\) and \(\F(\p_x f ) = i k \hat{f}\):</p>
\[\begin{aligned}
\F(\p_x x^n ) &= \F(n x^{n-1} ) \\
(ik) (i \p_k)^{n} \delta_k &= n (i \p_k)^{n-1} \delta_k \\
- k \delta_k^{(n)} &= n \delta_k^{(n-1)} \\
\end{aligned}\]
<p>This shows us the relationship between \(\delta_k^{(n)}\) and \(\delta_k^{(n-1)}\). Evidently they differ by a factor of \(-\frac{k}{n}\). Repeating the process (and switching the variables back to \(x\) since we don’t need the Fourier transforms anymore) gives</p>
\[\begin{aligned}
- x \delta^{(n)} &= n \delta^{(n-1)} \\
(-x)^2 \delta^{(n)} &= n (n-1) \delta^{(n-2)} \\
(-x)^3 \delta^{(n)} &= n (n-1) (n-2) \delta^{(n-3)} \\
& \vdots \\
(-x)^n \delta^{(n)} &= (n!) \delta^{(0)} \\
\end{aligned}\]
<p>Rearranging things, we get a bunch of useful identites:</p>
\[\begin{aligned}
\delta' &= - \frac{1}{x} \delta \\
\delta^{(2)} &= 2\frac{}{x^2} \delta \\
&\vdots \\
\delta^{(n)} &= \frac{ n!}{(-x)^n} \delta
\end{aligned}\]
<p>And also:</p>
\[\begin{aligned}
\frac{\delta}{x} &= - \delta' \\
\frac{\delta}{x^2} &= \frac{\delta^{(2)}}{2} \\
& \vdots \\
\frac{\delta}{x^n} &= \frac{(-1)^n}{n!} \delta^{(n)}
\end{aligned}\]
<p>Etc. I write this all out because it is easy to get confused by the factorials in there (and as a reference for myself…). Note that if \(\delta(x)\) is replaced with something like \(\delta(x - a)\), then all those denominators become \(\frac{1}{-(x-a)^n}\).</p>
<p>When you actually go to integrate these against a test function, it reveals an interesting relationship between delta functions and derivatives.</p>
\[\begin{aligned}
\int \frac{ n!}{(-x)^n} \delta f \d x &= \int \delta^{(n)} f \d x \\
&= \int (-1)^n \delta f^{(n)} \d x \\
\frac{n!}{(-0)^n} f(0) &\stackrel{?}{=} (-1)^n f^{(n)}(0) \\
\frac{n!}{0^n} f(0) &\stackrel{?}{=} f^{(n)}(0)
\end{aligned}\]
<p>The left side, of course, is really a <a href="https://en.wikipedia.org/wiki/Principal_value">principal value</a> \(\P \int \frac{ n!}{(-x)^n} \delta f \d x\), which we imagine to mean, basically, “evaluate this at zero but very carefully”. To see what this could mean, imagine that \(f(x)\) has a Taylor series \(f(x) = f_0 + f_1 x + \frac{x^2}{2!} f_2 + \ldots\). Then the left side <em>sorta</em> extracts the \(f_n\) term, because all the lower-order terms like \(\frac{0}{0^n}\) go to infinity (which we ignore?) and all the higher-order terms like \(\frac{0^{n + m}}{0^n} = 0^m\) go to zero.</p>
<p>Somehow this hints at the true magic of delta functions but I don’t quite see it yet.</p>
<hr />
<h2 id="3-multiplications-of-delta-act-like-integrals">3. Multiplications of \(\delta\) act like integrals?</h2>
<p>What about \(x^n \delta(x)\) where \(n > 0\)?</p>
<p>According to the actual rigorous theory of distributions, \(x^n \delta(x) = 0\) for any \(n > 1\), because its integral against a test function is zero. But I don’t believe them. I think there’s more going on here.</p>
<p>To illustrate this point, consider extending the argument of the last section to a function with a Laurent series (a finite number of negative-power terms):</p>
\[f(x) = \ldots + f_{-2} \frac{2!}{x^2} + f_{-1} \frac{1}{x} + f_0 + f_1 + f_2 \frac{x^2}{2!} + \ldots\]
<p>Then it is fairly clear that we could extract the negative-power terms in the same way:</p>
\[f_{-n} = \P \int \frac{x^n}{n!} \delta f \d x\]
<p>Assuming that, once again, all the powers of zero other than \(0^0 = 1\) “cancel out” somehow. So I would argue that \(\frac{x^n}{n!} \delta(x)\) is extracting <em>residues</em> the same way that \(\frac{n!}{x^n} \delta(x)\) extracts <em>derivatives</em>. It’s very nicely symmetric, if you’re willing to allow that \(x \delta \neq 0\).</p>
<p>What does it mean to extract a residue with a delta function? Well, it means that \(\P \int x \delta f d x\) is zero (or some other value we pretend to equal zero) unless \(f(x) \sim \frac{f_{-1}}{x}\) at that point, in which case it extracts that coefficient \(f_{-1}\). Residues aren’t quite the same thing as integrals, but what seems to happen is that, <em>when</em> you close your integration contours, residues are the only thing that’s left — like how in \(\bb{C}\), a closed integral picks up only the residues inside the integration boundary.</p>
<p>I guess this is useful in two ways. One, it’s the same idea of a residue that you get in complex analysis using the Cauchy integral formula:</p>
\[f_{-1}(z) = \frac{1}{2\pi i} \oint_C \frac{f(z)}{z} \d z\]
<p>but it’s extracted in a much more intuitive way. I have <a href="/2020/08/10/complex-analysis.html">written before</a> about how the Cauchy integral formula works. The short version is that if you apply Stoke’s theorm it turns into \(\iint \delta(\bar{z}, z) \d \bar{z} \^ \d z\), which relies on the fact that, for mysterious reasons, \(\p_{\bar{z}} \frac{1}{z} = 2 \pi i \delta(z, \bar{z})\).</p>
<p>Two, it makes it a lot easier to see how you would generalize the Cauchy integral formula, the concept of residues, and Laurent series to higher dimensions. Integrating against a delta function in the coordinate you care about — easy. Concocting a whole theory of contour integration — super weird and hard. Works for me.</p>
<p>The one weakness, of course, is that it’s rather unclear what to do with the fact that, evidently, \(\P \int \frac{x}{x^2} \delta \d x = 0\). Shouldn’t \(\frac{0}{0^2} = \infty\), or something like that? Not sure. But this is just one out of very many instances where it seems like math doesn’t handle dividing by zero correctly, so I guess we can file it away in that category and not worry about it for a while.</p>
<p>To summarize, we claim with some handwaving that:</p>
\[\< \frac{x^n}{n!} \delta, f \> = f^{(-n)}(0)\]
<p>Where the meaning of a “negative derivative” of a function is that it is a residue, ie, the \((-n)\)‘th term in the Laurent series of \(f\) around \(x=0\).</p>
<hr />
<h2 id="4-deltagx-becomes-a-sum-over-poles">4. \(\delta(g(x))\) becomes a sum over poles</h2>
<p>I always end up needing to look this up too.</p>
<p>Since \(\delta(x)\) integrates \(f(x)\) to \(f(0)\) at <em>every</em> zero of the \(\delta\), we of course have, via \(u = g(x)\) substitution:</p>
\[\begin{aligned}
\int \delta(g(x)) f(g(x)) \, dg(x) &= \int \delta(u) f(u) \, du \\
&=f(u)_{u = 0} \\
&=f(g^{-1}(0)) \\
\int \delta(g(x)) f(g(x)) \| g'(x) \| dx &= f(g^{-1}(0)) \\
\delta(g(x)) &= \sum_{x_0 \in g^{-1}(0)} \frac{\delta(x - x_0)}{\| g'(x_0) \|} \\
\end{aligned}\]
<p>For instance:</p>
\[\delta(x^2 - a^2) = \frac{\delta(x - a)}{2\| a \|} + \frac{\delta(x + a)}{2\| a \|}\]
<p>Somewhat harder to remember is the multivariable version:</p>
\[\begin{aligned}\int f(g(\b{x})) \delta(g( \b{x})) \| \det \del g(\b{x}) \| d^n \b{x}
&= \int_{g(\bb{R})} \delta(\b{u}) f(\b{u}) d \b{u} \\
\int f(\b{x}) \delta(g (\b{x})) d \b{x} &= \int_{\sigma = g^{-1}(0)} \frac{f(\b{x})}{\| \nabla g(\b{x}) \|} d\sigma(\b{x}) \\
\end{aligned}\]
<p>Where the final integral is in some imaginary coordinates on the zeroes of \(g(\b{x})\).</p>
<p>In general there is a giant model of “delta functions for surface integrals” which I’ve never quite wrapped my head around, but intend to tackle in a later article. Basically there’s a sense in which every line and surface integral, etc, can be modeled as an appropriate delta function. Wikipedia doesn’t talk about it much. There’s a couple lines on the delta function page, but there’s quite a more for some reason on the page for <a href="https://en.wikipedia.org/wiki/Laplacian_of_the_indicator">Laplacian of the Indicator</a>.</p>
<p>I’d also love to understand the version of this for vector- or tensor-valued functions as well. What goes in the denominator? Some kind of non-scalar object? Weird.</p>
<p>By the way, there is a cool trick which I found in a paper called <a href="https://www.reed.edu/physics/faculty/wheeler/documents/Miscellaneous%20Math/Delta%20Functions/Simplified%20Dirac%20Delta.pdf">Simplified Production of Dirac Delta Function Identities</a> by Nicholas Wheeler<sup id="fnref:wheeler" role="doc-noteref"><a href="#fn:wheeler" class="footnote" rel="footnote">1</a></sup> to derive \(\delta(ax) = \frac{\delta(x)}{\| a \|}\). We observe that \(\theta(ax) = \theta(x)\) if \(a > 0\) and \(\theta(ax) = 1 - \theta(x)\) if \(a < 0\). So we can compute \(\p_x \theta(ax)\) in two different ways:</p>
\[\begin{aligned}
\p_x \theta(ax) &= \p_x \theta(ax) \\
a \theta'(ax) &= \sgn(a) \theta'(x) \\
\delta(ax) &= \frac{1}{\| a \|} \delta(x)
\end{aligned}\]
<p>That paper also observes another property I hadn’t thought about, which is that</p>
\[\delta'(ax) = \frac{1}{a \| a \|} \delta(x)\]
<p>Basically, the funny “absolute value” business only happens in the derivative of \(\theta(ax)\), not the rest of the chain. There are also ways of deriving the more general properties like the form of \(\delta(g(x))\) by starting from derivatives of \(\theta(g(x))\).</p>
<hr />
<h2 id="5-delta-is-weird-in-other-coordinate-systems">5. \(\delta\) is weird in other coordinate systems</h2>
<p>I am often reminding myself how \(\delta\) acts in spherical coordinates.</p>
<p>It is useful to think about \(\delta\) as being defined like this:</p>
\[\delta(x) = \frac{1_{x =0 }}{dx}\]
<p>In the sense that it is designed to perfectly cancel out \(dx\) terms in integral. It’s \(0\) everywhere, except at the origin where it perfectly cancels the out \(dx\). Point is, \(\delta\) always transforms like the <em>inverse</em> of how \(dx\) transforms. If you write \(dg(x) = \| g'(x) \| dx\), then of course \(\delta\) transforms as</p>
\[\delta(g(x)) = \frac{\delta(x - g^{-1}(0))}{\| g'(x) \|}\]
<p>This, at least, makes it easy to figure out what happens in other coordinate systems.</p>
<p>By the way. The notation \(\delta^3(\b{x})\) customarily means that the function is <em>separable</em> into all the individual variables: \(\delta^3(\b{x}) = \delta(x) \delta(y) \delta(z)\). In other coordinate systems this <em>doesn’t</em> work: separating it requires introducing coefficients, as we’re about to see.</p>
<p>Here’s spherical coordinates:</p>
\[\begin{aligned}
\iiint_{\bb{R}^3} \delta^3(\b{x}) f(\b{x}) d^3 \b{x} &= f(x=0,y=0,z=0) \\
f(r=0, \theta=0, \phi=0)
&= \int_0^{2 pi} \int_{-\pi}^\pi \int_0^\infty \frac{\delta(r, \theta, \phi)}{r^2 \sin \theta} f(r, \theta, \phi) r^2 \sin \theta \, dr \, d \theta \, d \phi \\
&= \int_{-\pi}^\pi \int_0^\infty \frac{\delta(r, \theta)}{2 \pi r^2 \sin \theta} f(r, \theta, 0) ( 2 \pi r^2 \sin \theta) \, dr \, d \theta \\
&= \int_0^\infty \frac{\delta(r)}{4 \pi r^2 } f(r, 0, 0) ( 4 \pi r^2 ) \, dr \\
&= f(0,0,0)
\end{aligned}\]
<p>So:</p>
\[\begin{aligned}
\delta(x, y, z) &= \frac{\delta(r, \theta, \phi)}{r^2 \sin \theta} \\
&= \frac{\delta(r, \theta)}{2 \pi r^2 \sin \theta} \\
&= \frac{\delta(r)}{4 \pi r^2 }
\end{aligned}\]
<p>There is some trickiness to all this, though. Be careful: the \(r\) integral is from \((0, \infty)\) instead of the conventional \((-\infty, \infty)\). Sometimes identities that you’re used to working won’t work the same way if you are dealing with \(\delta(r)\) as a result. Also, it’s very unusual, but not impossible I suppose, to have functions that have a non-trivial \(\theta\) dependence even as \(r \ra 0\). I have no idea what that would be mean and I don’t know how to handle it with a delta function.</p>
<p>I’ve occasionally also seen it written in this weird way, where the \(\cos \theta\) factor causes the \(\sin \theta\) in the denominator to disappear.</p>
\[\delta(x,y,z) = \frac{\delta(r, \cos \theta, \phi)}{r^2}\]
<p>Here’s the polar / cylindrical coordinate version:</p>
\[\delta^2(x, y) = \frac{\delta(r, \theta)}{r} = \frac{\delta(r)}{2 \pi r}\]
<p>Evidently in \(\bb{R}^n\), the numerators are related to the surface areas of <a href="https://en.wikipedia.org/wiki/N-sphere">n-spheres</a>.</p>
<hr />
<h2 id="6-the-indicator-function-i_x--x-deltax">6. The Indicator Function \(I_x = x \delta(x)\)</h2>
<p>Related to \(x^n \delta(x)\) up above…</p>
<p>Since \(\int \delta(x) f(x) \, dx = f(0)\), we could flip this around and say that this is the <em>definition</em> of evaluating \(f\) at \(0\). Or, more generally, we could say that integrating against \(\delta(x-y) \, dx\) is “what it means” to evaluate \(f(y)\).</p>
<p>This is a bit strange though. Why does evaluation require an integral? Maybe we need to define a new thing, the indicator function, which requires no integral:</p>
\[I_x f = f(x)\]
<p>The definition is</p>
\[I_x = \begin{cases}
1 & x = 0 \\
0 & \text{otherwise}
\end{cases}\]
<p>But that probably masks its distributional character. A better definition is that it’s just</p>
\[I_x = x \delta(x)\]
<p>Whereas \(\delta_x\) is infinite at the origin and is defined to <em>integrate</em> to \(1\), the \(I_x\) function is just required to <em>equal</em> one at the origin. Of course, its integral is \(0\). It could also be constructed like this:</p>
\[I_x = \lim_{\e \ra 0^{+}} \theta(x - \e) - \theta(x + \e)\]
<p>(Compare to the \(\delta\) version: \(\delta_x = \lim_{\e \ra 0^{+}} \frac{1}{x} [ \theta(x - \e) - \theta(x + \e)] \stackrel{?}{=} \P(\frac{I_x}{x})\).)</p>
<p>By either definition, \(I_x\) has zero derivative everywhere:</p>
\[\p_x I_x = \delta(x) - \delta(-x) = 0 \\
\p_x (x \delta(x)) = \delta(x) + x \delta'(x) = \delta(x) - \delta(x) = 0\]
<p>Compare to \(\p_x \delta(x) = \delta'(x) = -\frac{\delta(x)}{x}\). It sorta seems like there might be some even <em>further</em> generalization of functions which could distinguish this derivative from \(0\), since obviously \(x \delta(x)\) is not, in fact, constant at \(x = 0\). The derivative would be some distribution-like object which has the property that \(I'(x, dx) = \begin{cases} 1 & x + dx = 0 \\ 0 & \text{otherwise} \end{cases}\) … which is weird.</p>
<p>\(I_x\) also has this delta-function-like property:</p>
\[\int I_x \frac{f(x)}{x} dx = f(0)\]
<p>(in a “principal value” sense, of course) It seems natural to consider a whole family of these with any power, \(x^k \delta(x)\). As we know, dividing \(\delta\) by powers of \(x\) like \(\delta^{(n)}(x)\) produces derivatives: \((\delta(x)/x )f(x) = -f'(0)\). So I would guess that these positive-power \(x^n \delta(x)\) functions produce… integrals? But normally (cf contour integration) integrals add up contributions from (a) boundaries and (b) poles (and really poles are a kind of boundary, topologically). These \(x^n \delta(x)\) terms only add up the constributions from poles but do nothing at infinity. Maybe that’s because in some sense the \(x^{-n} \delta(x) \propto \delta^{(n)}(x)\) terms are the ones that deal with poles at infinity?</p>
<p>I like this \(I_x\) object. It seems fundamental. Maybe we should just write \(f(x) = I_x f\) all the time.</p>
<hr />
<h2 id="7-miscellaneous-breadcrumbs">7. Miscellaneous Breadcrumbs</h2>
<p>Things I want to remember but don’t have much to say about:</p>
<p>There is a thing called a <a href="https://en.wikipedia.org/wiki/Wave_front_set">Wavefront Set</a> that comes from the subfield of “microlocal analysis”. It allows ‘characterizing’ singularities in a way that, for instance, would extract which dimensions a delta function product like \(\delta(x) \delta(y)\) is acting in.</p>
<p>Among other things, the Wavefront Set allows you to say when multiplying distributions together is well-behaved: as I understand it, they have to not have singularities “in the same directions”. \(\delta(x) \delta(x)\), for instance, is not allowed. (I bet \((x \delta)^2\) is, though.) Here’s a <a href="https://arxiv.org/abs/1404.1778">nice paper on the subject</a>.</p>
<p>There are further generalizations of functions called <a href="https://en.wikipedia.org/wiki/Hyperfunction">hyperfunctions</a>, which are instead defined in terms of “the difference of two holomorphic functions on a line” (which can be e.g. a pole at the origin). Gut reaction: relies on complex analysis, which sounds annoying.</p>
<p>A <a href="https://en.wikipedia.org/wiki/Current_(mathematics)">current</a> is a differential-form distribution on a manifold. Some day I’m going to have to learn about those, but for now, nah, I’m good.</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:wheeler" role="doc-endnote">
<p>This paper also has something interesting, if not entirely comprehensible, things to say about the existence of forward- and backwards- time propagators in QFT wave equation solutions. <a href="#fnref:wheeler" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>