How To Invert Everything By Dividing By Zero

September 25, 2023

For a generic linear equation like \(ax = b\) the solutions, if there are any, seem to always be of the form

\[x = a_{\parallel}^{-1} (b) + a_{\perp}^{-1} (0)\]

regardless of whether \(a\) is “invertible”. Here \(a_{\parallel}^{-1}\) is a sort of “parallel inverse”, in some cases called the “pseudoinverse”, which is the invertible part of \(a\). \(a_{\perp}^{-1}\) is the “orthogonal inverse”, called either the nullspace or kernel of \(a\) depending what field you’re in, but either way it’s the objects for which \(ax = 0\). Clearly \(ax = a (a_{\parallel}^{-1} (b) + a_{\perp}^{-1} (0)) = a a_{\parallel}^{-1} (b)\), and that’s the solution if one exists.

This pattern shows up over and over in different fields, but I’ve never seen it really discussed as a general phenomenon. But really, it makes sense: why shouldn’t any operator be invertible, as long as you are willing to have the inverse (a) live in some larger space of objects and (b) possibly become multi-valued?

Here are some examples. Afterwards I’ll describe how this phenomenon gestures towards a general way of dividing by zero.

1. The Pattern

Matrices

Consider a linear system of equations that’s represented by matrix multiplication as

\[A \b{x} = \b{b}\]

If \(A\) is square and invertible, then the unique solution to this equation is given via left-multiplication by \(A^{-1}\):

\[\b{x} = A^{-1} \b{b}\]

If \(A\) is not square and invertible, then the system of equations is either underspecified (has a non-trivial linear subspace of possible solutions) or overspecified (is unsolvable because there are more linearly-independent constraints than degrees of freedom).

The matrix \(A\) partitions its domain and codomain into some subspaces:

the row space \(\text{row}(A)\) is all \(x\) such that \(Ax \neq 0\), so it’s the input vectors that \(A\) produces something interesting from.
the kernel \(\text{ker}(A)\) is all \(x\) such that \(ax = 0\), so it’s input vectors that \(A\) ignores.
the column space \(\text{col}(A)\) is all \(y\) such that \(Ax = y\) for some \(x\), so it’s output vectors that \(A\) can produce.
the cokernel \(\text{coker}(A)\) is all \(y\) such that \(Ax \neq y\) for any \(x\), so it’s output vectors that \(A\) can’t produce.

\(A\b{x} = \b{b}\) is solvable if \(\b{b}\) lies in \(\text{col} (A)\). If so, \(\b{x}\) will consist of a sum of elements from \(\text{row}(A)\), but can also include any element from \(\text{ker}(A)\) freely since it doesn’t change the value of \(A \b{x}\). That is, it takes the form \(\b{x} = \b{x}_{\text{one}} + \b{x}_{\text{free}}\), where the first term is one solution and the second term produces all the others. In fact the explicit form is given in terms of the Moore-Penrose Pseudoinverse \(A^{+}\):

\[\b{x} = A^+ (\b{b}) + (I - A^+ A) \vec{\lambda}\]

where \(\vec{\lambda}\) is a vector of free parameters corresponding to the number of degrees of freedom in the solution space.

\(A^{+}\) acts like an inverse, but only between \(\text{row}(A)\) and \(\text{col}(A)\) (which necessarily have the same dimension). Instead of the condition \(A A^{-1} = I\) the pseudoinverse obeys the weaker condition \(A A^{+} A = A\), which is why the second term above cancels out: \(A(I - A^+ A) = A - A A^+ A = A- A = 0\). Meanwhile \(A^+ \b{b}\) maps \(\b{b}\) from the column space back to the row space, if possible (or gets as close as possible, if not).

Anyway, of the instances of this “generalized inverse” thing, this matrix pseudoinverse is the most refined and easiest to understand. But it seems to make sense in a lot of other settings.

Scalars

Once again, if an equation of the form \(ax = b\) can be inverted, it usually looks like this:

\[x = a_{\parallel}^{-1} (b) + a_{\perp}^{-1}(0)\]

Where the first term is a single solution that satisfies the constraint, and the second term is any displacement ‘orthogonal’ to \(a\) within the solution space that has \(a (a_{\perp}^{-1}) = 0\).

For instance here is how you invert “scalar multiplication”, that is, how you divide by zero.

\[\begin{aligned} ax &= b \\ x &= b/a + \lambda 1_{a=0} \end{aligned}\]

\(\lambda\) is a free parameter and \(1_{a=0}\) an indicator function. If \(a=0\) and \(b=0\) then this works as long as we interpret \(b/a = 0/0\) to equal, say, \(1\). If \(a=0\) but \(b \neq 0\) then the answer is infinite, which makes sense: there’s really no solution to \(0x = b\) unless \(x\) is an object that acts like \(b/0\). So we can write the solutions as:

\[x = \begin{cases} b/a & a \neq 0 \\ b \times \infty & a = 0, b \neq 0 \\ \text{anything} & a = 0, b = 0 \\ \end{cases}\]

Not that this is especially useful on its own, but it’s one of a pattern. In a sense it is the right answer, in that it contains all the information that the true inverse of this operation has to contain — just, represented in a way we don’t really know how to do math with. Also note that this is basically the \(1 \times 1\) version of the matrix pseudoinverse.

Vectors

Here’s the \(1 \times N\) version, for a more interesting example.

\[\b{a} \cdot \b{x} = b\]

Then

\[\begin{aligned} \b{x} &= \b{a}_{\parallel}^{-1} (b) + \b{a}_{\perp}^{-1}(0) \\ &= \frac{\b{a}}{\| \b{a} \|^2} (b) + \vec{\lambda} \cdot \b{a}_{\perp} \end{aligned}\]

The first term is again the pseudoinverse \(\b{a}\) (naturally, since \(\b{a}\) is a matrix). For vectors we can write it easily: it’s \(\b{a}^{+} = \b{a}_{\parallel}^{-1} = \frac{\b{a}}{\| \b{a} \|^2}\). It only inverts the subspace in which \(\b{a}\) is invertible, which for a vector is just one-dimensional. The second term is the “orthogonal inverse” of \(\b{a}\), which is any element of the \((N-1)\)-dimensional nullspace which is orthogonal to \(\b{a}\). \(\vec{\lambda}\) here is a meaningless vector of imagined free parameters which select a particular element out of \(\b{a}_{\perp}\).

Functions

Here’s an example with functions.

\[x f(x) = 1\]

The solution is

\[f(x) = \begin{cases} 1/x & x \neq 0 \\ \lambda & x = 0 \end{cases}\]

\(\lambda\) is again a free parameter. We’d kinda like to write these as one equation, and the way we can do it is like this:

\[f(x) = \mathcal{P}(\frac{1}{x}) + \lambda \delta(x)\]

Where \(\mathcal{P}\) is the Cauchy Principal Value and \(\delta(x)\) is a delta distribution. Whereas vectors and matrices forced us to “kick up into” a space with free parameters, functions require us to also “kick up into” the space of distributions. Neither \(\P\) nor \(\d\) gives a useful value on its own at \(x=0\), but when used as a distribution (e.g. integrated against a test function) this object really does behave like the solution to \(x f(x) = 1\). This is a pattern: when we invert non-invertible things, we often have to (a) introduce free parameters and (b) “leave the space we’re in” for a larger space that can contain the solution. In this case we leave “function space” for “distribution space” because \(\frac{1}{x}\) has no value at \(x=0\).¹

What if we have more powers of \(x\)?

\[x^n f(x) = 1\]

We should get something like

\[f(x) = \mathcal{P}(\frac{1}{x^n}) + \delta(x) [\lambda_1 + \lambda_2 x + \ldots \lambda_{n-1} x^{n-1} ]\]

Since, by the same argument, all of those terms should go to \(0\) when \(x=0\) as well.

Differential Operators

Here’s another example:

\[D f = g\]

The natural solution is

\[f = D^{-1}_{\parallel}(g) + D^{-1}_{\perp}(0)\]

What do these terms mean for a differential operator? The second term, again, is an inverse operator that produces the “nullspace” of \(D\), which is to say, produces solutions to the homogenous equation \(D f = 0\). These are “free wave” solutions to the operator, which are packaged with a (probably) infinite number off free parameters. In general there is also a set of boundary conditions on the solution \(f\), so we’ll need to pick from these free parameters to satisfy the boundary conditions.

Or, we could solve it using a Green’s functions for \(D\), by computing \(G = D^{-1}_{\parallel}(\delta)\) and then \(f = G \ast g\).

Or, we could invert the operator by directly inverting in Fourier space, which will end up being the function inversion from the previous section:

\[\hat{G} = \hat{D}_{\perp}^{-1}(\hat{g}) + \hat{D}_{\parallel}^{-1}(0)\]

The \(\hat{D}_{\parallel}^{-1}(0)\) term will produce the homogenous solutions to the differential operator when un-Fourier-transformed.

Example: the Poisson equation \(\del^2 f(\b{x}) = g(\b{x})\) is Fourier transformed into \((i k)^2 \hat{f}(\b{k}) = \hat{g}(\b{k})\) so \(\hat{f}(\b{k}) = \frac{1}{(ik)^2} \hat{g}(\b{k}) + \lambda_1 \delta(k) + O(\frac{1}{k}) \ldots\).² The third term can be \(\frac{1}{k} \delta(k)\), \(\frac{A\b{k}}{k^2}\), or higher-order tensor combinations as long as the net magnitude is \(O(\frac{1}{k})\).

When untransformed we should get

\[\begin{aligned} f(\b{x}) &= \mathcal{F}^{-1} [\frac{1}{(ik)^2} \hat{g}( \b{k}) + \lambda_1 \delta(k) + O(\frac{1}{k}) \lambda_2 \delta(k)] \\ &= - \int \frac{1}{4 \pi \| r - r_0 \|} g(\b{x}_0) d \b{x}_0 + \text{<the harmonic functions>} \end{aligned}\]

Although I’m not quite sure how to take those Fourier transforms so I can’t promise it.

Anyway, point is, whatever our generalized inverse operation looks like on derivative operators, it should be related via Fourier transforms to the generalized inverse on functions.

2. Inverting Matrices By Dividing By Zero

There is an interesting way of looking at the procedure for solving a matrix equation \(A\b{x} = \b{b}\). We recast the problem into projective geometry by adding a single “homogenous” coordinate to the vector space, and then rewriting the equation like this:

\[(A, -\vec{b}) \cdot (\b{x}, 1) = 0\]

\((A, -\b{b})\) is a single \(N \times (N+1)\)-dimensional matrix, and now we’re solving an \((N+1)\)-component vector instead. That vector will come to us with the last coordinate \(\neq 1\), of course, so we divide \((\lambda \b{x}, \lambda)\) through by \(\lambda\) at the end to find the true value of \(\b{x}\). That division won’t work if \(\lambda = 0\), of course, but in that case we interpret “dividing by zero” to equal “infinity”, so that \((\b{x}, 0) \sim (\infty \b{x}, 0)\). In projective geometry these infinite objects are completely meaningful and useful; for instance the intersection of two parallel lines is a “point at infinity” and this means that the statement “any two lines intersect in a point” becomes true without any discalimers. As such the division-by-zero is a feature, not a bug.

The nice thing about this form is that there’s no longer two heterogenous terms, \(a^{-1}_{\parallel}\) and \(a^{-1}_{\perp}\), in the inverse. Writing \(B = (A, \b{b})\), the new equation is now

\[B (\b{x}, 1) = 0\]

And the only thing we’re concerned with is the nullspace of that matrix. Of course, the matrix is no longer square, so it’s going to have a degree of freedom in its output (which cancels out with the new coordinate that was forced to equal \(1\)). But that’s not a problem: we want to deal with degrees of freedom in a generic way in the output anyway. We’ll just expect to get at least one.

The solution to this is best understood in terms of exterior algebra. First let’s walk through the case where \(A\) is invertible, so that \(B^{\^n} \neq \vec{0}\). Then there is a multivector \(B^{\^n}\) which represents every dimension in \(B\), ie, every dimension it does not set to \(0\). Its complement \(B_{\perp} = \star B^{\^ n}\) represents every dimension \(B\) does set to zero. Since \(A\) is invertible (for now), this complement is one-dimensional, and we can compute it:

\[\begin{aligned} B_{\perp} &= \star B^{\^ n} \\ &= \star [(A, -\b{b})^{\^ n}] \\ &= \star (A^{\^(n-1)} \^ (-\b{b}), A^{\^ n}) \end{aligned}\]

Now note that \(\star A^{\^(n-1)} = - \text{adj}(A)\), the adjugate matrix of \(A\), \(\star [A^{\^(n-1)} \^ (-\b{b})] = \text{adj}(A) \b{b}\), and \(\star A^{\^n} = \det(A)\), the determinant. So that gets us to the inverse:

\[\begin{aligned} \lambda (\b{x}, 1) &= \star (\star {-\text{adj}(A)} \cdot (-\b{b}), \star \det(A)) \\ \b{x} &= \frac{\text{adj}(A)}{\det(A)} \cdot (-\b{b}) \\ &= A^{-1} \b{b} \end{aligned}\]

… Up to some signs that I’m not really sure about. It is very hard to keep track of how signs should work when you try to connect exterior algebra objects to regular linear algebra objects with \(\star\). At some point I’m going to hash it all out for myself so I can stop leaving this disclaimers but I haven’t gotten to it yet. See the aside for some justification, though.

Until I get a chance to really figure out the rules for \(\star\), here’s a calculation I did to make sure I’m right.

Suppose \(B = (A_1, A_2, A_3, -\b{b})\) where each vector has three-components. We’ll call the fourth dimension \(\b{w}\) and declare the unit pseudoscalar to be \(\b{x} \^ \b{y} \^ \b{z} \^ \b{w}\). The rule for computing \(\star\) is:

\[\alpha \^ (\star \alpha) = \| \alpha \|^2 \omega\]

So e.g. \(\star (\b{x\^y\^w}) = - \b{z}\), because \(\b{x \^ y \^ w \^} \^ (-\b{z}) = \omega\) (because we have to move the \(\b{z}\) one spot over to get back to \(\omega\).)

Now, the easiest way to think about \(B^{\^ 3}\) is to just write down its four terms in any order, but keep track of their basis components:

\[B^{\^3} = \underbrace{B_1 \^ B_2 \^ (-\b{b})}_{\b{x \^ y \^ w}} + \underbrace{B_2 \^ B_3 \^ (-\b{b})}_{\b{y \^ z \^ w}} + \underbrace{B_3 \^ B_1 \^ (-\b{b})}_{\b{z \^ x \^ w}} + \underbrace{B_1 \^ B_2 \^ B_3}_{\b{x \^ y \^ z}}\]

When we compute \(\star B^{\^ 3}\), the signs follow these components, so we get

\[\star B^{\^ 3} = \underbrace{B_1 \^ B_2 \^ (-\b{b})}_{- \b{z}} + \underbrace{B_2 \^ B_3 \^ (-\b{b})}_{- \b{x}} + \underbrace{B_3 \^ B_1 \^ (-\b{b})}_{- \b{y}} + \underbrace{B_1 \^ B_2 \^ B_3}_{+ \b{w}}\]

Then you just shuffle the coordinates into whatever order you wanted your original matrix in, which is probably \((\b{x}, \b{y}, \b{z}, \b{w})\).

\[\star B^{\^ 3} = (- \text{adj}(A) \cdot (-\b{b}), \det (A))\]

It’s not elegant but it’ll have to do, at least until I can find or construct the forbidden calculus of noncommutative tensorial exterior algebra.

I think of this object as a “negative \(1\)-th wedge power”:

\[A^{-\^ 1} = \frac{ A^{\^(n-1)}}{A^{\^n}}\]

Which is to say, it’s like the \((n-1)\)-th wedge power, but divided through by the \(n\)-th, giving a total “grade” of \(-1\). It’s an object that becomes the inverse matrix if we translate it back into matrix form with \(\star\), but is still meaningful… sorta… for non-invertible matrices.

The step of this equation that fails if \(A\) is not invertible is that \(A^{\^n} = 0\), hence, the problem of inverting matrices is the same thing as the problem of dividing by zero, and a theory of dividing by zero correctly should definitely be a theory of inverting non-invertible matrices also!

What happens when we do that division? We get something like… I guess… that the solution to \(Ax = 1\) is something like \(x = \P ( \frac{1}{A}) + \vec{\lambda} \cdot A_{\perp} \delta(\det A) + \ldots\), a series of delta functions of different ordres. But how many \(\lambda\) components should we end up with? Obviously it should come from the actual grade of the solution multivector \(A_{\perp}\). So whatever system of dividing by zero we come up with is going to have to be able to “divide by a zero multivector” also.

The number of terms we end up with is related to (compared to the Moore-Penrose pseudo-inverse formula) the multiplicity of the zero at \(\det(A) = 0\). The more “copies of zero”, the more degrees of freedom in the output space that we’ll have to produce. That means that \(\text{diag}(0,1)\) and \(\text{diag}(0, 0)\) should be different objects: the former is \(\propto 0\) while the second is \(\propto 0^2\), so that we know how many degrees of freedom are produced in the inverse. The number of copies of zero is going to be the same as the number of \(0\) singular values in the matrix’s SVD.

The value we end up getting should in some sense be the same object as the Moore-Penrose pseudoinverse. So that means that this

\[\b{x} = A^+ (\b{b}) + (I - A^+ A) \vec{\lambda}\]

is definitely another way of writing this

\[\b{x} = A^{\^(-1)} \cdot (-\b{b})\]

which should, when properly formulated, be a formula that solves linear systems of equations “in full generality”.

So, the “inverse of zero” is “a degree of freedom.” Division by a multivector necessarily produces extra \(\lambda\) terms, proportional to “how zero it is”.

By the way: the only reason we moved into projective coordinates was to deal with the problems of “dividing by zero” and dealing with infinities in a way our math system could handle. But if our math system could handle dividing by zero and producing degrees of freedom correctly on its own, then we wouldn’t have needed any of the projective coordinate stuff. This suggests a general principle: projective coordinates are being used to work around a problem in the underlying calculus.

There’s a lot more to say about this and I hope to investigate it later (and actually go learn algebraic geometry so I’m not cluelessly ignorant! Hmm.) But instead, let’s go look at differential equations again.

3. Integration Constants and Boundary Conditions

The way we solved \(A \b{x} = \b{b}\) in projective coordinates for matrices can give us intuition for inverting differential operators like \(D f = g\).

Consider how differential equations’ inverses produce degrees of freedom in their solutions.³ For starters, obviously \(N\)th-order differential equation can end up with \(N\) integration constants:

\[\p_x^N (f(x) + c_1 + c_2 x + \ldots c_N x^{N-1}) = \p_x^N f(x)\]

More generally an operator like \(\p_x - g\) sets to zero any function that has \(\p_x (f) = g\). (There are better examples for this but I can’t remember them at the moment.)

Similarly, in many cases the boundary conditions of a boundary value problem can be viewed as an operator. For instance the condition \(f(x) = f(x + 2 \pi)\) is like finite difference operator \(T f = f(x) - f(x + 2\pi) = 0\), and the solution set is given by the inverse \(T^{-1}(0)\). So, the solution to a boundary value problem may be viewed as starting from the solution to two individual problems: the differential operator and the boundary condition. Each can be written in the form

Now look what happens when we add boundary conditions. In many cases we can treat the boundary conditions as a separate linear operator

\[\begin{aligned} D f &= g \\ B f &= 0 \\ \begin{pmatrix} D & -g \\ B & 0 \end{pmatrix} \begin{pmatrix} f \\ 1 \end{pmatrix} &= \begin{pmatrix} 0 \\ 0 \end{pmatrix} \end{aligned}\]

Let’s assign \(Q = \begin{pmatrix} D & -g \\ B & 0 \end{pmatrix}\), so that

\[Q\cdot (f, 1) = 0\]

And then maybe we can try to manipulate it like the matrices in the previous section?

\[(f, 1) = Q^{-1}(0) = \star [\wedge^{n} Q]\]

Just in the matrix case. Although I don’t have any idea how you compute \(\^^n\) of a “block matrix of operators”. But if you could, it seems like you would get the correct form again, by using something akin to the matrix pseudoinverse:

\[f = Q^{- \^1}(0) = Q_{\parallel}^{-1} (0) + Q_{\perp}^{-1} (\lambda)\]

This is basically what you already do solving a boundary value problem, but I like the idea that applying all the conditions is a result of simple algebra instead of a process that exists outside of the algebra itself. \(Q_{\perp}\) is smaller than it was for \(D\), because of the need to reduce the solution space to compensate for the boundary conditions, and the exact degree to which it is smaller corresponds to how many degrees of freedom it took to express the boundary conditions in terms of the kernel of \(D\). Basically I expect that there is a way to generically give meaning to the object \(Q^{-\^1}(0)\) such that this expression makes sense:

\[\begin{aligned} Q^{-\^1}(0) &= \begin{pmatrix} D & -g \\ B & 0 \end{pmatrix}^{-\^1} (0) \\ &= (\text{a particular solution in } D^{-1}(g) \vee B^{-1}(0)) \\ &+ (\text{any point in a manifold of free parameters}) \end{aligned}\]

4. Conclusions?

As you can see, if you followed any of that: signs exist that there is some kind of general calculus for inverting all kinds of linear objects which applies the same way to scalars, matrix equations, differential equations, and maybe more. Somehow it involves dividing by zero in a consistent way and being able to produce free parameters in arbitrary manifolds in the solutions to equations.

The basic procedure which seems to happen over and over is:

To solve \(ax = b\),
Produce \(x = (a, -b)^{-\^ 1}\).
Which also has the form \(x = a_{\parallel}^{-1}(b) + a_{\perp}^{-1} \lambda\)
Which by definition solves the equation, and is parameterized by a free point \(\lambda\) in the solution manifold.

The trick is figuring out what any of that means, lol.

I don’t see how it works at all, but the symmetry is so good that it’s agonizing. Of course everything gets more and more abstract the further we go into differential equations or non-linearity — but maybe there’s somebody out there who can piece it all together? And maybe, somehow, we can figure how it has to work by mastering all the subjects, comparing all the examples, and poking in at the edges… until we find its true form?

My other articles about Exterior Algebra:

In the context of any particular problem, \(\lambda\) may have a definite value. For instance when computing the Fourier transform of \(\p_x \theta(x) = \delta(x)\) (where \(\theta\) is the step function), one ends up with \((ik) \hat{\theta}(k) = 1\). Upon dividing through by \((ik)\) the result is \(\frac{1}{ik} + \pi \delta(k)\), where \(\pi\) is the right answer; it’s actually \((2 \pi) (\frac{1}{2})\), where \(\frac{1}{2}\) is the average value of \(\theta(x)\), hence it’s the right coefficient at \(k=0\). ↩
If you go to solve this, keep in mind that \(\delta(k) = \frac{\delta(\b{k})}{4 \pi k^2}\). ↩
It has been ten years since I took a differential equations class so apologies if this is a bit simplistic. ↩

« Data Science Isn't

Delta Function Miscellanea »

▲ Home