How To Invert Anything By Dividing By Zero

September 25, 2023

(warning: this starts comprehensible and then goes off the rails a bit)

For a generic linear equation like \(ax = b\), solutions seem to always be of the form

\[x = a_{\parallel}^{-1} (b) + a_{\perp}^{-1} (0)\]

regardless of whether \(a\) is “invertible”. Here \(a_{\parallel}^{-1}\) is a sort of “parallel inverse”, in some cases called the “pseudoinverse”, which is the invertible part of \(a\). \(a_{\perp}^{-1}\) is the “orthogonal inverse”, called either the nullspace or kernel of \(a\) depending what field you’re in, but either way it’s the objects for which \(ax = 0\).

This pattern shows up over and over in different fields, but I’ve never seen it really discussed as a general phenomenon. But really, it makes sense: why shouldn’t any operator be invertible, as long as you are willing to have the inverse (a) live in some larger space of objects and (b) possibly become multi-valued?

Here are some examples. Afterwards I’ll describe how this phenomenon gestures towards a general way of dividing by zero.


1. The Pattern

Matrices

Consider a linear system of equations that’s represented by matrix multiplication as

\[A \b{x} = \b{b}\]

If \(A\) is square and invertible, then the unique solution to this equation is given via left-multiplication by \(A^{-1}\):

\[\b{x} = A^{-1} \b{b}\]

If \(A\) is not square and invertible, then the system of equations is either underspecified (has a non-trivial linear subspace of possible solutions) or overspecified (is unsolvable because there are more linearly-independent constraints than degrees of freedom). If it’s one of those cases and you’re an undergrad, you give up. If you’re a professional you sigh and look up how to solve it or ask Mathematica or NumPy or something, I don’t know, I’m not a professional either; the point is that these can be solvable depending on the nature of \(A\) and \(\b{b}\) but it’s a lot harder.

The general solution uses the Moore-Penrose Pseudoinverse \(A^{+}\) which, instead of the condition \(A A^{-1} = I\), obeys the weaker condition \(A A^{+} A = A\). Then the solution, if there is one, is given by

\[\b{x} = A^+ (\b{b}) + (I - A^+ A) \vec{\lambda}\]

where \(\vec{\lambda}\) is a vector of free parameters corresponding to the number of degrees of freedom in the solution space. The solution ends up being of the form \(\b{x} = \b{x}_{\text{one}} + \b{x}_{\text{free}}\), where the first term is one solution and the second term produces all the others. This is quite cool. The first term produces a single actual solution vector that satisfies the \(A \b{x} = \b{b}\) constraint. The second term constructs a matrix which rotates the \(\vec{\lambda}\) free parameters so that they span the nullspace of \(A\). On plugging back into \(A \b{x} = \b{b}\), the second term \((I - A^+ A)\) will cancel out entirely due to the condition on \(A^+\). Of course, if \(A\) is invertible, then \(A^+ A = I\) and the second term is zero anyway.

It’s easiest to see what \(A^+\) actually is via SVD representation of \(A\). SVD writes \(A = U \Sigma V^*\), where \(\Sigma\) is the matrix of singular values. It has one positive diagonal entry for each subspace it doesn’t annihilate, and one zero for each subspace it does annihilate. \(A^+\), then, is given by \(V \Sigma^+ U^*\), where \(\Sigma^+\) is the matrix of non-zero entries from the SVD diagonal, \(\Sigma^+ = \text{diag}(1/\sigma_i)\). So it inverts any dimension that \(A\) doesn’t annhiliate and leaves the rest untouched.

Anyway, of the instances of this “generalized inverse” thing, this matrix pseudoinverse is the most refined and easiest to understand. But it seems to make sense in a lot of other settings.


Scalars

Once again, if an equation of the form \(ax = b\) can be inverted, it usually looks like this:

\[x = a_{\parallel}^{-1} (b) + a_{\perp}^{-1}(0)\]

The first term is a single solution that satisfies the constraint, and the second term is any displacement ‘orthogonal’ to \(a\) within the solution space that has \(a (a_{\perp}^{-1}) = 0\).

For instance here is how you invert “scalar multiplication”, that is, how you divide by zero.

\[\begin{aligned} ax &= b \\ x &= b/a + \lambda 1_{a=0} \end{aligned}\]

\(\lambda\) is a free parameter and \(1_{a=0}\) an indicator function. If \(a=0\) and \(b=0\) then this works as long as we interpret \(b/a = 0/0\) to equal, say, \(1\). If \(a=0\) but \(b \neq 0\) then the answer is infinite, which makes sense: there’s really no solution to \(0x = b\) unless \(x\) is an object that acts like \(b/0\). So we can write the solutions as:

\[x = \begin{cases} b/a & a \neq 0 \\ b \times \infty & a = 0, b \neq 0 \\ \text{anything} & a = 0, b = 0 \\ \end{cases}\]

Not that this is especially useful on its own, but it’s one of a pattern. In a sense it is the right answer, in that it contains all the information that the true inverse of this operation has to contain — just, represented in a way we don’t really know how to do math with. Also note that this is basically the \(1 \times 1\) version of the matrix pseudoinverse.


Vectors

Here’s the \(1 \times N\) version, for a more interesting example.

\[\b{a} \cdot \b{x} = \b{b}\]

Then

\[\begin{aligned} \b{x} &= \b{a}_{\parallel}^{-1} \cdot \b{b} + \b{a}_{\perp}^{-1}(0) \\ &= \frac{\b{a}}{\| \b{a} \|^2} \cdot \b{b} + \b{a}_{\perp}^{-1}(0) \end{aligned}\]

The first term is again the “pseudoinverse” of \(\b{a}\), written \(\b{a}^{+} = \b{a}_{\parallel}^{-1} = \frac{\b{a}}{\| \b{a} \|^2}\). It’s like an inverse, but only in the subspace in which \(\b{a}\) is invertible, which for a vector is just one-dimensional. The second term is the “orthogonal inverse” of \(\b{a}\), which is any element of the \((N-1)\)-dimensional nullspace which is orthogonal to \(\b{a}\). We could also choose to write that as an \((n-1)\times(n-1)\) matrix times a vector of \((n-1)\) free parameters, if we wanted to stick with the pseudoinverse form, but I don’t mind leaving it like it is.


Functions

Here’s an example with functions.

\[x f(x) = 1\]

The solution is

\[f(x) = \begin{cases} 1/x & x \neq 0 \\ \lambda & x = 0 \end{cases}\]

\(\lambda\) is again a free parameter. We’d kinda like to write these as one equation, and the way we can do it is like this:

\[f(x) = \mathcal{P}(\frac{1}{x}) + \lambda \delta(x)\]

Where \(\mathcal{P}\) is the Cauchy Principal Value and \(\delta(x)\) is a delta distribution. Whereas vectors and matrices forced us to “kick up into” a space with free parameters, functions require us to also “kick up into” the space of distributions. Neither \(\P\) nor \(\d\) gives a useful value on its own at \(x=0\), but when used as a distribution (e.g. integrated against a test function) this object really does behave like the solution to \(x f(x) = 1\). This is a pattern: when we invert non-invertible things, we often have to (a) introduce free parameters and (b) “leave the space we’re in” for a larger space that can contain the solution. In this case we leave “function space” for “distribution space” because \(\frac{1}{x}\) has no value at \(x=0\).1

What if we have more powers of \(x\)?

\[x^n f(x) = 1\]

We should get something like

\[f(x) = \mathcal{P}(\frac{1}{x^n}) + \delta(x) [\lambda_1 + \lambda_2 x + \ldots \lambda_{n-1} x^{n-1} ]\]

Since, by the same argument, all of those terms should go to \(0\) when \(x=0\) as well.


Differential Operators

Here’s another example:

\[D f = g\]

The natural solution is

\[f = D^{-1}_{\parallel}(g) + D^{-1}_{\perp}(0)\]

What do these terms mean for a differential operator? The second term, again, is an inverse operator that produces the “nullspace” of \(D\), which is to say, produces solutions to the homogenous equation \(D f = 0\). These are “free wave” solutions to the operator, which are packaged with a (probably) infinite number off free parameters. In general there is also a set of boundary conditions on the solution \(f\), so we’ll need to pick from these free parameters to satisfy the boundary conditions.

Or, we could solve it using a Green’s functions for \(D\), by computing \(G = D^{-1}_{\parallel}(\delta)\) and then \(f = G \ast g\).

Or, we could invert the operator by directly inverting in Fourier space, which will end up being the function inversion from the previous section:

\[\hat{G} = \hat{D}_{\perp}^{-1}(\hat{g}) + \hat{D}_{\parallel}^{-1}(0)\]

The \(\hat{D}_{\parallel}^{-1}(0)\) term will produce the homogenous solutions to the differential operator when un-Fourier-transformed.

Example: the Poisson equation \(\del^2 f(\b{x}) = g(\b{x})\) is Fourier transformed into \((i k)^2 \hat{f}(\b{k}) = \hat{g}(\b{k})\) so \(\hat{f}(\b{k}) = \frac{1}{(ik)^2} \hat{g}(\b{k}) + \lambda_1 \delta(k) + O(\frac{1}{k}) \ldots\).2 The third term can be \(\frac{1}{k} \delta(k)\), \(\frac{A\b{k}}{k^2}\), or higher-order tensor combinations as long as the net magnitude is \(O(\frac{1}{k})\).

When untransformed we should get

\[\begin{aligned} f(\b{x}) &= \mathcal{F}^{-1} [\frac{1}{(ik)^2} \hat{g}( \b{k}) + \lambda_1 \delta(k) + O(\frac{1}{k}) \lambda_2 \delta(k)] \\ &= - \int \frac{1}{4 \pi \| r - r_0 \|} g(\b{x}_0) d \b{x}_0 + \text{<the harmonic functions>} \end{aligned}\]

Although I’m not quite sure how to take those Fourier transforms so I can’t promise it.

Anyway, point is, whatever our generalized inverse operation looks like on derivative operators, it should be related via Fourier transforms to the generalized inverse on functions.


2. Inverting Matrices By Dividing By Zero (in projective coordinates)

There is an interesting way of looking at the procedure for solving a matrix equation \(A\b{x} = \b{b}\). We recast the problem into projective geometry by adding a single “homogenous” coordinate to the vector space, and then rewriting the equation like this:

\[(A, -\vec{b}) \cdot (\b{x}, 1) = 0\]

\((A, -\b{b})\) is a single \(N \times (N+1)\)-dimensional matrix, and now we’re solving an \((N+1)\)-component vector instead. That vector will come to us with the last coordinate \(\neq 1\), of course, so we divide \((\lambda \b{x}, \lambda)\) through by \(\lambda\) at the end to find the true value of \(\b{x}\). That division won’t work if \(\lambda = 0\), of course, but in that case we interpret “dividing by zero” to equal “infinity”, so that \((\b{x}, 0) \sim (\infty \b{x}, 0)\). In projective geometry these infinite objects are completely meaningful and useful; for instance the intersection of two parallel lines is a “point at infinity” and this means that the statement “any two lines intersect in a point” becomes true without any discalimers. As such the division-by-zero is a feature, not a bug.

The nice thing about this form is that there’s no longer two heterogenous terms, \(a^{-1}_{\parallel}\) and \(a^{-1}_{\perp}\), in the inverse. Writing \(B = (A, \b{b})\), the new equation is now

\[B (\b{x}, 1) = 0\]

And the only thing we’re concerned with is the nullspace of that matrix. Of course, the matrix is no longer square, so it’s going to have a degree of freedom in its output (which cancels out with the new coordinate that was forced to equal \(1\)). But that’s not a problem: we want to deal with degrees of freedom in a generic way in the output anyway. We’ll just expect to get at least one.

The solution to this is best understood in terms of exterior algebra. First let’s walk through the case where \(A\) is invertible, so that \(B^{\^n} \neq \vec{0}\). Then there is a multivector \(B^{\^n}\) which represents every dimension in \(B\), ie, every dimension it does not set to \(0\). Its complement \(B_{\perp} = \star B^{\^ n}\) represents every dimension \(B\) does set to zero. Since \(A\) is invertible (for now), this complement is one-dimensional, and we can compute it:

\[\begin{aligned} B_{\perp} &= \star B^{\^ n} \\ &= \star [(A, -\b{b})^{\^ n}] \\ &= \star (A^{\^(n-1)} \^ (-\b{b}), A^{\^ n}) \end{aligned}\]

Now note that \(\star A^{\^(n-1)} = - \text{adj}(A)\), the adjugate matrix of \(A\), \(\star [A^{\^(n-1)} \^ (-\b{b})] = \text{adj}(A) \b{b}\), and \(\star A^{\^n} = \det(A)\), the determinant. So that gets us to the inverse:

\[\begin{aligned} \lambda (\b{x}, 1) &= \star (\star -\text{adj}(A) \cdot (-\b{b}), \star \det(A)) \\ \b{x} &= \frac{\text{adj}(A)}{\det(A)} \cdot (-\b{b}) \\ &= A^{-1} \b{b} \end{aligned}\]

… Up to some signs that I’m really not sure about. It is very hard to keep track of how signs should work when you try to connect exterior algebra objects to regular linear algebra objects with \(\star\). At some point I’m going to hash it all out for myself so I can stop leaving this disclaimers but I haven’t gotten to it yet. See the aside for some justification, though.

I think of this object as a “negative \(1\)-th wedge power”:

\[A^{-\^ 1} = \frac{ A^{\^(n-1)}}{A^{\^n}}\]

Which is to say, it’s like the \((n-1)\)-th wedge power, but divided through by the \(n\)-th, giving a total “grade” of \(-1\). It’s an object that becomes the inverse matrix if we translate it back into matrix form with \(\star\), but is still meaningful… sorta… for non-invertible matrices.


The step of this equation that fails if \(A\) is not invertible is that \(A^{\^n} = 0\), hence, the problem of inverting matrices is the same thing as the problem of dividing by zero, and a theory of dividing by zero correctly should definitely be a theory of inverting non-invertible matrices also!

What happens when we do that division? We get something like… I guess… that the solution to \(Ax = 1\) is something like \(x = \P ( \frac{1}{A}) + \vec{\lambda} \cdot A_{\perp} \delta(\det A) + \ldots\). But how many \(\lambda\) components should we end up with? Obviously it should come from the actual grade of the solution multivector \(A_{\perp}\). So whatever system of dividing by zero we come up with is going to have to be able to “divide by a zero multivector” also. The number of terms we end up with is related to (compared to the Moore-Penrose pseudo-inverse formula) the multiplicity of the zero at \(\det(A) = 0\). The more “copies of zero”, the more degrees of freedom in the output space that we’ll have to produce. That means that \(\text{diag}(0,1)\) and \(\text{diag}(0, 0)\) should be different objects: the former is \(\propto 0\) while the second is \(\propto 0^2\), so that we know how many degrees of freedom are produced in the inverse. The number of copies of zero is going to be the same as the number of \(0\) singular values in the matrix’s SVD.

So, why not? Maybe we should be letting ourselves divide by zero in order to be able to invert matrixes correctly? I like it, personally. Of course, maybe it immediately runs into problems if you try to make it rigorous — I wouldn’t be surprised to learn that there’s a theorem that says something like “no possible theory of algebra can consistently count orders of zero in this way”; it sounds like just the sorts of demoralizing things mathematicians would come up with.

The value we end up getting should in some sense be the same object as the Moore-Penrose pseudoinverse. So that means that this

\[\b{x} = A^+ (\b{b}) + (I - A^+ A) \vec{\lambda}\]

is definitely another way of writing this

\[\b{x} = A^{\^(-1)} \cdot (-\b{b})\]

which should, when properly formulated, be a formula that solves linear systems of equations “in full generality”.

We can sketch out a way of equating the two representations. We imagine factoring \(A = A_{\parallel} \oplus A_{\perp}\), where the former has \(A_{\parallel}^{\^} \neq 0\) (where \(A^\^\) notationally means the “largest possible wedge power of a non-square matrix”) and the latter is a matrix with all the columns that are linearly dependent with the former. Then the inverse is going to be \(\b{x} = A_{\parallel}^{-1} (\b{b}) + A_{\perp} \vec{\lambda}\). I’m not sure how to deal with the fact that there might be no solutions rather than \(>1\) solutions, though. Nevertheless it seems like the “inverse of zero” is “a degree of freedom”, which makes a lot of sense. Division by a zero multivector necessarily produces extra \(\lambda\) terms, which connects the exterior algebra formula with the pseudoinverse one.

By the way: the only reason we moved into projective coordinates was to deal with the problems of “dividing by zero” and dealing with infinities in a way our math system could handle. But if our math system could handle dividing by zero and producing degrees of freedom correctly on its own, then we wouldn’t have needed any of the projective coordinate stuff. This suggests a general principle: projective coordinates are being used to work around a problem in the underlying calculus.

There’s a lot more to say about this, which I hope to investigate in later articles. For instance dividing by zero seems fun to play around with in other settings. Also, it’s not clear that there’s any good algorithm for these multivector representations anyway. Also, we didn’t consider overdetermined matrices at all, which are probably going to make even more zeroes. Who knows!? We’ll see. But instead, let’s take a look at differential equations again.


3. Integration Constants and Boundary Conditions

The way we solved \(A \b{x} = \b{b}\) in projective coordinates for matrices can give us ideas for solving \(D f = g\) for differential operator also.

First let’s look how differential equation inverses produce degrees of freedom in their solutions. This is not something I know at any particular theoretical level, but over the years I’ve picked up a sense of it. For starters, obviously \(N\)th-order differential equation can end up with \(N\) integration constants:

\[\p_x^N (f(x) + c_1 + c_2 x + \ldots c_N x^{N-1}) = \p_x^N f(x)\]

But other operators end up with infinite series of solutions corresponding to their infinite space of eigenvalues:

\[\begin{aligned} ( \p_x^2 + \lambda^2) f(x) &= 0 \\ f(x) &= \sum a_n \sin \lambda x + b_n \cos \lambda x \end{aligned}\]

Just from these examples it’s clear that, once again, the space of solutions is proportional to the space of things that an operator sets to zero.3 \(\p_x^2\) sets only two polynomial terms \(1\) and \(x\) to zero, so its solution space has two degrees of freedom. \((\p_x^2 + \lambda^2)\), on the other hand, has a whole \(\bb{Z}^2\) worth of values it sets to zero, which is why it ends up with so many values in its solution set. Presumably these zero sets are measurable in the exact same mathematics by which determinants can be different “powers of zero” in the previous section.

Here’s another thing. Our solution set will once again be parameterized by some perpendicular-inverse object, times another object of free parameters

\[f(x) = D_{\parallel}^{-1} g(x) + D_{\perp}^{-1} (\Lambda)\]

For instance for the inverse of \(\p_X^N\), the free-parameter object is still a vector \((c_1, c_2, \ldots c_N)\). For the infinite sines and cosines it’s a matrix \(\begin{pmatrix} a_1 & a_2 & \ldots \\ b_1 & b_2 & \ldots \end{pmatrix}\). I would not be surprised to find out that there are equations whose solutions are parameterized by tensors or even manifolds (like, “there’s one solution per point on a sphere”).

Now look what happens when we add boundary conditions. In many cases we can include the boundary conditions as an additional term in a “homogenized” equation:

\[\begin{cases} D f &= g \\ B f &= 0 \\ \end{cases}\]

That is:

\[\begin{pmatrix} D & -g \\ B & 0 \end{pmatrix} \begin{pmatrix} f \\ 1 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}\]

Let’s assign \(Q = \begin{pmatrix} D & -g \\ B & 0 \end{pmatrix}\), so that

\[Q\cdot (f, 1) = 0\]

Then it feels like a good guess that in some sense this equation should be meaningful:

\[(f, 1) = Q^{-1}(0) = \star [\wedge^{n} Q]\]

Just in the matrix case. Although I don’t have any idea how you compute \(\^^n\) of a block matrix of differential operators. But if you could, it seems like you would get the correct form again:

\[f = Q^{- \^1}(0) = Q_{\parallel}^{-1} (0) + Q_{\perp}^{-1} (\lambda)\]

only now \(Q\) is an object that the boundary conditions encoded in it. The result had better be: \(Q_{\perp}\) is smaller than it was for \(D\), because of the need to reduce the solution space to compensate for the boundary conditions, and the exact degree to which it is smaller corresponds to how many degrees of freedom it took to express the boundary conditions in terms of the kernel of \(D\).

If that sounds weird, it’s not. It’s a thing we already do when solving differential equations in QFT. When solving for the behavior of a field of charges and photons, you find a solution that has the form “field contribution due to charges in the region” plus “free parameters corresponding to photons that are passing through the area”. In the absence of boundary conditions, any number of photons can be passing through and un-interacted with. But when we specify a boundary condition, we have to find a way to write the boundary condition as a sum of photon waves, which are the free-field terms, so that we can express it entirely in objects that have \(Df = 0\). In the case of photons of course that means taking the Fourier transform of the boundary.4

Basically, I expect that there is a way to give meaning to the object \(Q^{-\^1}(0)\) such that this expression makes sense:

\[\begin{aligned} Q^{-\^1}(0) &= \begin{pmatrix} D & -g \\ B & 0 \end{pmatrix}^{-\^1} (0) \\ &= (\text{the solution due to the charges}) \\ &+ (\text{the boundary condition-preserving free field terms}) \\ &* (\text{a point in a manifold of free parameters}) \end{aligned}\]

… but in a non-physics-related sense, where this is just how you solve equations in generality. Wouldn’t that be cool?

Of course, I wouldn’t know where to start with making \(Q^{-\^1}\) mean something right now; that’s a matrix of operators you’re talking about! But, heck, give me the right non-commutative exterior algebra and maybe it just works? It’s too clean not to.

(Aside: I wonder if it’s possible for a differential equation to have a disconnected manifold worth of free parameters. Probably yes? I wouldn’t be surprised at all. In that case what kind of object does \(\lambda\) end up being? I it’s fine for it to be a “disconnected manifold”; it would be the equation solver’s job to find an actual coordinate system when they want to express the solutions.)


4. Conclusions?

As you can see, if you followed any of that: signs exist that there is some kind of general calculus for inverting all kinds of linear objects which applies the same way to scalars, matrix equations, differential equations, and maybe more. Somehow it involves dividing by zero in a consistent way and being able to produce free parameters in arbitrary manifolds in the solutions to equations.

The basic procedure which seems to happen over and over is:

  • To solve \(ax = b\),
  • Produce \(x = (a, -b)^{-\^ 1}\).
  • Which also has the form \(x = a_{\parallel}^{-1}(b) + a_{\perp}^{-1} \lambda\)
  • Which by definition solves the equation, and is parameterized by a free point \(\lambda\) in the solution manifold.

The trick is figuring out what any of that means, lol.

I don’t see how it works at all, but the symmetry is so good that it’s agonizing. Of course everything gets more and more abstract the further we go into differential equations or non-linearity — but maybe there’s somebody out there who can piece it all together? And maybe, somehow, we can figure how it has to work by mastering all the subjects, comparing all the examples, and poking in at the edges… until we find its true form?


My other articles about Exterior Algebra:

  1. Oriented Areas and the Shoelace Formula
  2. Matrices and Determinants
  3. The Inner product
  4. The Hodge Star
  5. The Interior Product
  6. EA as Linearized Set Theory?
  7. Oriented Projective Geometry
  8. All the Exterior Algebra Operations
  9. How To Invert Anything By Dividing By Zero
  1. In the context of any particular problem, \(\lambda\) may have a definite value. For instance when computing the Fourier transform of \(\p_x \theta(x) = \delta(x)\) (where \(\theta\) is the step function), one ends up with \((ik) \hat{\theta}(k) = 1\). Upon dividing through by \((ik)\) the result is \(\frac{1}{ik} + \pi \delta(k)\), where \(\pi\) is the right answer; it’s actually \((2 \pi) (\frac{1}{2})\), where \(\frac{1}{2}\) is the average value of \(\theta(x)\), hence it’s the right coefficient at \(k=0\). 

  2. If you go to solve this, keep in mind that \(\delta(k) = \frac{\delta(\b{k})}{4 \pi k^2}\). 

  3. I’m vaguely aware that this is the point of the field called “algebraic geometry”, but it gets so technical and abstract that I haven’t learned much about it. I would much rather see things intuitively.