Sunday, 22 June 2025

The Product Topology, or, How Not to Stumble into the Box Topology

WARNING: This post won’t make much sense unless you’ve read at least my first Topology As Touching post.

Product Topologies and the Mystery of the Finiteness Constraint

Okay, it looks like I’m back to posting about topology. As per usual, I have been slow-burn revisiting topological notions through the lens of touching, a different axiomatization of topology that seems to give me a different intuitive sense of the relevant concepts than the way I originally learned about them from Munkres’s textbook. In general I’m a fan of being able to triangulate mathematical concepts from different perspectives, and topology’s abstractness and counter-intuitive facets push me to keep searching for new sources of clarity. One of these counter-intuitive facets is the construction of product topologies, especially when the product is infinite. As I’ll describe next, the usual presentation magically introduces a particular “finiteness” constraint that feels to me like it arises with little intuition, aside from “getting the right answers.” I hope to show that by switching from the open-set conception of topologies to the touching conception of topologies, we can extract a bit more intuition about why finiteness arises in infinite product topologies.

Product Topologies the Open Set Way

Munkres, like most topology texts, explains the concept of topology in terms of open sets (Bourbaki can be credited/blamed for this). Thus, when it comes time to explain many topological concepts, they are often described in terms of their effect on open sets or on the interaction between open sets and them. Naturally this approach is applied to product topologies.

Given two sets \(A\) and \(B\), and their elements, \(a \in A\), \(b \in B\), we often describe the product set \(A \times B\) of ordered pairs \(\braket{a,b}\) of elements of the two sets. This idea extends to topologies: given two topological spaces \(\mathcal{A}\) and \(\mathcal{B}\), we can describe the product space \(\mathcal{A} \times \mathcal{B}\), i.e. the space given the product topology in accordance with topologies \(\mathcal{A}\) and \(\mathcal{B}\). Since some set \(A\) underlies the space \(\mathcal{A}\) and some set \(B\) underlies \(\mathcal{B}\), we expect, rightfully, the product space \(\mathcal{A} \times \mathcal{B}\) to be underlay by the product set \(A \times B\) of pairs. The question to be answered is: what is the appropriate topology to apply to subsets of \(A \times B\), given the topology that \(\mathcal{A}\) imposes on \(A\) and \(\mathcal{B}\) imposes on \(B\)?

Let \(O_A\) be the set of open sets associated with \(\mathcal{A}\), and \(O_B\) be the open sets for \(\mathcal{B}\). Then the open sets for the product topology \(\mathcal{A} \times \mathcal{B}\) are constructed from the subsets \(U_A \times U_B\) for every open set \(U_A \in O_A\) and \(U_B \in O_B\). The details of the construction are not important for the purposes of this post, but at least to grad-student me, given the idea of open sets in the first place, it “feels right” that the open sets of \(\mathcal{A} \times \mathcal{B}\) would be constructed from the open sets of the component topologies.

All is great...right up until the point where we consider extending the idea to infinite products. We can generalize the notation of set-theoretic products \(A \times B\) to be \(\Pi_{j \in \set{0,1}} A_j\), so that \(A\) is now \(A_1\) and \(B\) is now \(A_2\). Then each pair \(\braket{a,b}\) is not unlike having a Python dictionary \(\set{1:a, 2:b}\) where the numbers \(1\) and \(2\) are the keys to records. In this explicitly indexed notation, the order of the entries no longer matters. This notation comes in handy if we want an infinite set of keys like \(j \in \mathbb{N}\) (where \(\mathbb{N}\) is the set of natural numbers \(0, 1, 2, \dots\)), which gives us a record with an infinite number or elements. So the general notation for generalized products becomes \(\Pi_{j \in J} A_j\), but for simplicity we’ll focus on \(J = \mathbb{N}\), the natural numbers.

At this point, the seemingly obvious (to me, and it seems, to Munkres) approach to extending product topologies to infinite products ends up being wrong. Given a family of topological spaces \(\left( \mathcal{A}_j \right)_{j \in \mathbb{N}}\), you might expect the open sets to be constructed from the sets \(\Pi_{j \in \mathbb{N}} O_j\), i.e. arbitrary products of open sets drawn from family of topologies. Unfortunately the result is indeed a topology, but it is the wrong one, in the sense that this particular topological space exhibits weird and unpleasant properties. That kind of topology, called the box topology, ends up being a source of mind-bending counterexamples. To quote Munkres:

We shall find that a number of important theorems about finite products will also hold for arbitrary products if we use the product topology, but not if we use the box topology. As a result, the product topology is extremely important in mathematics. The box topology is not so important; we shall use it primarily for constructing counterexamples.

So if following our nose does not get us the “right” product topology, how do we produce the right product topology? Well, to summarize quickly, it is constructed from subsets \(\Pi_{j \in \mathbb{N}} G_j\) where a finite number of the \(G_j\) are open sets \(O_j\) of their corresponding topologies, and the rest of the \(G_j\) are the entire coordinate set \(A_j\). What the heck?!? For some reason only a finite number of open sets from the collection of topologies is used. Why? Well, as Munkres explains it, because the resulting infinite product topologies behave like the finite ones. The textbook does provide some theorems about the two topologies that will give some satisfaction to some folks with background in a certain branch of abstract mathematics (see the end of this post if you see yourself in this sentence!), but I do not find that explanation satisfying from the internal point of view of topology itself.

Going forward, once the distinction between box topology and product topology is made, the textbook proposes various exercises to demonstrate that the box topology is just wrong. One learns by example, not by concept, about this wrongness. Don’t touch the third rail.

You should guess by now where this is going. Thanks to my time staring at, and thinking about, touching, I think I have an explanation for whither this peculiar finitarity arises in the product topology. In addition to the appeal to touching, this is also a consequence of thinking differently about products: not as particular sets, but as particular interfaces. Hopefully this is an example of how important it is to distinguish between extensionally equivalent mathematical definitions, in that they may convey significantly distinct intensional knowledge that suggests substantially different conceptions and generalizations. In this case, at least I would not have gotten here without switching from open sets to touching.

Products of Sets

First, let’s set aside topology and ask the simpler question: what is a product? I’m going to answer this question in the context of set theory, but the idea carries over to other mathematical foundations. A product is not a thing, but rather it’s an interface: an API if you will. The most common presentations of set theory (like ZFC) only have sets as entities: they don’t even have atomic non-set elements, let alone pairs of them! The main reason is that sets end up being enough: you can encode everything else using them, just like you can encode all the stuff on your computer using 0’s and 1’s. You don’t want to think at that level of abstraction 99% of the time, but it keeps things quite simple and uniform if your goal is to reason about set theory, and not to get work done in a set theory (hence my quip that set theory is the X86 instruction set of mathematics: ugly but effective).

To explain products, we’ll start simple, with pairs of natural numbers, and work up to products in general. So suppose I have the number \(5\), and the number \(7\). Informally speaking, I may want to put them together as a pair \(\braket{5,7}\), with \(5\) as the first coordinate, and \(7\) as the second. On the other hand, I might want the numbers to be reversed, \(\braket{7,5}\), with \(7\) as the first coordinate and \(5\) as the second. These are not the same pair, because the first coordinates are not equal, and also the second coordinates are not equal either. This distinction, identifying a pair with its first element and second element, is the essential concept: a pair \(p_1\) is identical to the pair \(p_2\) if and only if the first coordinate of \(p_1\) is identical to the first coordinate of \(p_2\) and the second coordinate of \(p_1\) is identical to the second coordinate of \(p_2\). For the moment, we will use the notation \(\pi_1(p)\) to denote the first coordinate of \(p\), and \(\pi_2(p)\) for the second, assuming that \(p\) is indeed a pair. Furthermore, for each ordered couple of numbers \(n_1\) and \(n_2\) there is a unique pair that has them as coordinates. For this we use the notation \(\braket{n_1,n_2}\).

Ok, so this describes an interface to pairs of natural numbers \(\mathbb{N}\times \mathbb{N}\). How do we actually “implement” that interface? Wellllllll, you can implement it however you want, so long as your “implementation” satisfies the properties that you want. Here’s an “implementation” that works: associate to each ordered couple of numbers, \(n_1\) and \(n_2\) the single natural number \(2^{n_1} \mathop{*} 3^{n_2}\). We’ll denote this pair-of-two-numbers-represented-by-one-number \(\braket{n_1,n_2}\). This is our representation of pairing. For example, \(\braket{5,7} \equiv 69,984\) and \(\braket{7,5} \equiv 31,104\). Clear as mud! Now define the set \(P\) to be the set of all natural numbers \(\mathbb{N}\) that can be represented in the form \(2^{n_1} * 3^{n_2}\) for some \(n_1\) and \(n_2\). Now define a function \(\pi_1\) to be the function from \(P\) to \(\mathbb{N}\) that produces the unique value \(n_1\) such that \(p = 2^{n_1} * 3^{n_2}\) for some natural number \(n_2\), i.e. \(\pi_1(\braket{n_1,n_2}) = n_1\). Finally take the analogous step for \(\pi_2\). Voila! So long as you don’t violate the abstraction that you’ve created, then you have a representation of pairs of natural numbers (here is where having a type checker might help). Notice that these pairs \(p\) do not literally have copies of \(n_1\), and \(n_2\) hiding in them. They’re just numbers that satisfy some convenient arithmetical relationship to \(n_1\) and \(n_2\). The functions \(\pi_1\) and \(\pi_2\) are where all the real work happens. You might have noticed that I introduced our representation of pairs using notation (i.e. syntactic sugar), whereas I introduced projection as a function. This saves me from the circular pain of describing a priori what the domain of a function \(\braket{\bullet,\bullet}\) is. For reference, this trick, often associated with Kurt Gödel, depends on \(2\) and \(3\) being prime numbers, and the uniqueness of prime factorization.

There are many, many alternative sets \(P\), conceptions of \(\braket{\bullet,\bullet}\), and finction \(\pi_1\) and \(\pi_2\) that would suffice to implement the pair-of-natural-numbers interface. And we can always translate between implementations as needed if we have multiple different ones at play. All of these implementations implement the same thing, which is the concept of pairs of natural numbers, and impose a behavioural notion of identity among them (cf. “Duck Typing”).

Generalizing beyond numbers, some implementations of pairs in set theory are completely agnostic to the kinds of things being paired. The most famous is due to Kuratowski: \(\braket{a,b} \equiv \set{a,\set{a,b}}\). I’m not going to explain this one, but it just uses a simple set-theoretic encoding to distinguish between the first element and the second element. Note, however, that the pair of two copies of the same element \(\braket{a,a} \equiv \set{a,\set{a}}\), which is a little wild, in that this doesn’t cause any problems (Kuratowski was a bright dude!). There are many, many set-theoretic pair encodings, each with different “implementation-side” properties: but they all satisfy the same interface. Furthermore, these ideas can be extended to arbitrary products \(\Pi_{j \in J} A_j\).

Some Geometric Intuition for Product Topologies

Product topologies build upon this abstract notion of products of sets. But before I get into the formal bits of the product topology, I will try to bring some geometric intuition to bear on the problem. In particular, we’ll start with Two-dimensional Euclidean space, the space of two-dimensional geometric objects. In this world, a point touches a set of points if and only if the point is either a member of the set or is “arbitrarily close” to the set, as demonstrated in some of my earlier posts about topology. For now geometric intuition ought suffice.


Fig. 1: A disc and a point

Imagine that you have a disc and a point next to one another (see Fig 1). Does the point touch the disc? I’m hoping that you will say “no”. But how can you tell? Intuitively there is a gap between the point and any point on the line, which is easy to tell when looking “down” on this figure from “up above” in the third-dimension, which as it happens does not exist in two-dimensional geometry.

\(\pi_2(D)\)
\(\pi_1(D)\)
Fig. 2: Disc and point, plus coordinate projections

So let’s consider the problem from within two-dimensional geometry. Suppose that, in order to answer the question, you are only allowed to look the “shadow” that the figure casts onto the (one-dimensional) x axis, and the shadow that the figure casts onto the (also one-dimensional) y axis (see Fig 2). Uh oh! If we view the projection of the disc and point onto the coordinate axes, then from both vantages, the point (drawn huge for emphasis) appears to be touching the line segment that results from projecting the disc. So it seems, at first, like we simply cannot judge touching correctly based on coordinate projections.

But we can! This is thanks to one of the axioms of touching: if \(a\) is a point, and \(A\) and \(B\) are (possibly overlapping) sets, then \(a \mathrel{\delta}A \cup B\) implies tha \(a \mathrel{\delta}A\) or \(a \mathrel{\delta}B\), where \(a \mathrel{\delta}A\) is the relation that says a point \(a\) touches a set \(A\), and by which a topological space can be described. In words, if \(a\) touches a set that can be described in two parts, then it must touch at least one of those two parts. This statement generalizes to any finite number of parts, because we can repeat this reasoning any (finite) number of times, slicing and dicing the original set piecemeal.

This axiom immediately implies the following contrapositive statement: If \(a \mathrel{\not\delta}A\) and \(a \mathrel{\not\delta}B\) then \(a \mathrel{\not\delta}A \cup B\), where \(\mathrel{\not\delta}\) means “does not touch”. So to prove that \(a\) does not touch some aggregate object, it suffices to break it into two pieces and show that \(a\) does not touch either of the pieces.

We also need one more piece of reasoning to solve our puzzle, and it’s based on the fact that the two projections onto the coordinate axes, which we’ll call \(\pi_1\) and \(\pi_2\), are continuous functions. A \(f\) between two topological spaces is continuous if and only if \(a \mathrel{\delta}A\) implies \(f(a) \mathrel{\delta}f(A)\), where \(f(A)\) is the pointwise mapping of all elements of the set \(A\). I hope that via geometric intuition you can believe that if the point did in fact touch the disc, then the projection of the point would necessarily touch the projection of the disc. By contrapositive, we get: If \(f(a) \mathrel{\not\delta}f(A)\) then \(a \mathrel{\not\delta}A\)

Notice what the above does not say! If we discover that \(f(a) \mathrel{\delta}f(A)\), that tells us nothing about whether \(a \mathrel{\delta}A\) or not. This explains why our disc in fact does not touch the point, but the two projected lines nonetheless touch their corresponding projected points. So sad!

\(\pi_2(A)\) \(\pi_2(B)\)
\(\pi_1(A)\)
\(\pi_1(B)\)
Fig. 3: Bifurcated disc and point, plus \(A\) and \(B\) projections

But what if we cut the disc into two semi-discs \(A\) and \(B\) (see Fig. 3)? Well first consider \(A\): \(\pi_2(p) \mathrel{\delta}\pi_2(A)\), which is no better than our previous attempt, but \(\pi_1(p) \mathrel{\not\delta}\pi_1(A)\), which implies \(A \mathrel{\not\delta}p\)! Similarly, \(\pi_1(p) \mathrel{\delta}\pi_1(B)\), but \(\pi_2(p) \mathrel{\not\delta}\pi_2(B)\), which implies \(B \mathrel{\not\delta}p\)! Since we know \(p \mathrel{\not\delta}A\) and \(p \mathrel{\not\delta}B\), we get \(p \mathrel{\not\delta}A \cup B\). Voila!


Fig. 4: Trifurcated disc and point

We need to reflect on this a bit more to understand the pattern. Had our point been a bit further up or to the right of the disc, then we would not have needed to break the disc into two pieces: one big piece would have sufficed because our original projections could have observed non-touching. So we don’t need to always break our set into two pieces, which happens to be the number of coordinates of our product space. Can we break the disc into more than two pieces? The answer is yes! Suppose we broke \(A\) into pieces \(A_1\) and \(A_2\), and left \(B\) the same, giving us a total of three pieces (see Fig. 4). Then we would have found that \(\pi_1(p) \mathrel{\not\delta}\pi_1(A_1)\) and \(\pi_1(p) \mathrel{\not\delta}\pi_1(A_2)\) so all three pieces would still not touch \(p\). Moreover, in the illustrated case, we have split \(A\) in a way that \(\pi_2(p) \mathrel{\not\delta}\pi_2(A_2)\) also holds. So the three-way split was overkill: we could have left things as \(A\) and \(B\), but we could also have originally split into \(A_1\) and \(B \cup A_2\), or even the overlapping \(A = A_1 \cup A_2\) and \(A_2 \cup B\)! Given an arbitrary finite split, we could combine subsets that \(\pi_1\) can prove do not touch \(p\) and combine subsets that \(\pi_3\) can prove do not touch \(p\) and still prove failure to touch. In this two-dimensional case, two subsets are sufficient but not necessary.

To summarize, the key to proving that a point does not touch a two-dimensional set through coordinate projections is to find some way to split the set into a finite number of pieces that suffice to prove, via some projection, that each piece fails to touch the point. If this is not possible, then the point must touch the set. It is this strategy that we seek to generalize to the definition arbitrary products of arbitrary topological spaces.

From Products of Sets to Products of Topological Spaces

In the last geometric example, we came up with a way to use projection and the 1-dimensional geometry to reason about 2-dimensional touching. So we noticed that our 2 dimensional notion of non-touching corresponded with non-touching of projections of chunks of the 2-dimensional set. This is the key to understanding what a topological product is. Once again, it’s not a set, it’s an interface, but that interface has more structure than the one for mere set-theoretic products.

Suppose you have a set \(A\) and a set \(B\), and furthermore you have a notion of touching \(\mathrel{\delta}_A\), and \(\mathrel{\delta}_B\) for each, so they form topological spaces. We already know that we can produce some set \(P\) that implements products \(A \times B\) set-theoretically, some representation\(\braket{a,b}\) that that denotes pairing an element of \(A\) and an element of \(B\) to a pair, and projections \(\pi_1\), and \(\pi_2\) that can uniquely project out elements from the pair representation.

Now, our problem is to answer the question “what topological structure (i.e. touching relationship) applied to \(P\) makes it a product of \((A,\mathrel{\delta}_A)\) and \((B,\mathrel{\delta}_B)\)? Well, if we want to incorporate the topological structure of \(A\) and \(B\) like our geometric example, then we need the topology on \(P\) to ensure that \(\pi_1\) and \(\pi_2\) are continuous functions. But that’s too easy: just give \(P\) the discrete topology: the one where touching coincides with elementhood (\(x \mathrel{\delta}_P X\) if and only if \(x \in X\), the least amount of touching possible), or where all subsets are open, so the most amount of open sets). That would make the two functions continuous, trivially. Looking from the other side, consider the trivial topology: the one where every point touches every set (so the most amount of touching), or where the only open sets are the empty set and the entire space (so the least amount of touching). Our projections \(\pi_1\) and \(\pi_2\) would not be continuous unless \((A,\mathrel{\delta}_A)\) and \((B,\mathrel{\delta}_B)\) were also trivial spaces. So we want something in between but canonical.

Here is what we want for \(P\): the topology with the most touching possible for which both \(\pi_1\) and \(\pi_2\) are continuous. This can also be called the coarsest topology, because coarseness in topology is defined in terms of open sets, so it’s the topology with the fewest open sets. As it happens, this topology, \(\mathrel{\delta}_P\) will have the property that \(p \mathrel{\not\delta}B\) if and only if \(B\) can be broken up into a finite number of pieces \(B_1,\dots,B_n\) such that for each \(B_i\) we can use \(\pi_1\) or \(\pi_2\) to prove that \(p \mathrel{\not\delta}B_i\).

Conveniently enough, for pairs we can always succeed with just two pieces: union together all the pieces \(B_i\) that can be deemed non-touching by \(\pi_1\), and union together all the pieces that can be deemed non-touching by \(\pi_2\).

To Infinity And Beyond!

Ok, so we came up with an API-based characterization of the product topology for pairs: first come up with a set-based notion of products, and then, given a notion of touching for each coordinate, induce a notion of touching for the entire product space \(P\). What’s wild is that this definition works for arbitrary products, even infinite ones. Furthermore, it helps explain why the definition of infinitary product topologies in terms of open sets has the finiteness constraint!

Instead of subscripts \(1\) and \(2\), consider an infinite number of subscripts \(j \in \mathbb{N}\). So we have an infinite product space, with one projection \(\pi_i\) for each coordinate. We no longer try to use bracket notation \(\braket{\dots}\) to denote these sequences because we don’t have enough ink. So given an infinite collection of topological spaces \(\mathcal{A}_j = (A_j,\mathrel{\delta}_j)\), and a set-theoretic product \(P = \Pi_{j\in \mathbb{N}} A_j\), what is the corresponding product topology \(\mathcal{P} = \Pi_{j\in \mathbb{N}} \mathcal{A}_j\)? Well, it’s the direct analogue of our geometric space: given a product of sets \(\Pi_{j\in \mathbb{N}} A_j\) it’s the coarsest touching relation \(\mathrel{\delta}_P\) over that set—coarsest in the sense of satisfying the most touching relationships, and thereby the fewest non-touching relationships—for which all (infinite) \(\pi_i\) are continuous. Technically one must prove that such a touching relation always exists, which it does. And just like before, you get the property that \(x \mathrel{\not\delta}_P X\) if and only if \(X\) can be broken up into a finite number of pieces, each of which can be proven non-touching by some projection \(\pi_i\). The wild thing is that you now have an infinite number of projections \(\pi_i\) to work with, but because of the axioms of touching, you are still constrained to only break \(X\) into a finite number of pieces for purposes of refutation.

If you violate that restriction and allow as many pieces as there are projections (i.e. infinitely many), then you end up with fewer touching relationships on \(P\) than your projections \(\pi_j\) would let you get away with, and moreover I think this gives you the dreaded box topology, but why the heck would you do that?!?

As our geometric example suggested earlier, if you have a finite number of projections, then “as many pieces as projections” is both finite, and sufficient for determining non-touching according to the product topology (whose definition allows any finite number of pieces, including more pieces than there are projections). As such, for finite products, the box topology and the product topology coincide!

A Kind Note for Archers

If you are a category theory aficionado and have gotten this far without breaking your reading device, I commend you! Many of the ideas that I describe above, about API’s, interconvertibility, and the such, are most compactly and most generally described using the concept of product categories, or more generally limits, and friends. And indeed, if you already knew about category theory, I could transmit sufficient conditions for you to know what a product category is in one sentence: it’s any pair of set and continuous projections that satisfies the structure of a categorical product. The possibility of this compactness might make the very non-compact mass of words above feel a bit frustrating.

However I don’t think the above sentence could convey, viscerally, all of the different intuitions that I wanted to convey to a general audience: thinking with API’s, being constrained to looking through projections, and the focus on breaking sets into finite pieces. I guess you could say that I like to appeal to categories to make sure I got “the right answer”, but I compile them away to explain concepts in a way that tickles pre-mathematical intuitions and demands fewer pre-requisites.

Sunday, 12 January 2025

What's So "Natural" About Natural Transformations?!?

I’m probably the wrong person to listen to when it comes to anything about that extension of abstract algebra we call category theory. I’ve been trying to learn about it for literally decades, and much of that time has involved an uncomfortable feeling that something fundamental about the material is just bouncing off of me. Contrast this with many of my colleagues who find great comfort in conceiving of mathematical concepts in the terminology of this area, and feel unsure whether they understand some concept before it has been rephrased in category-theoretic terms. Unfortunately these explanations (which scroll past me on social media and group conversations and Q&A at seminars, and...and...) don’t seem to be at the right level or angle to scaffold me up further into the topic. I experience this regularly and in some ways I am intensely jealous. Something about this doesn’t click for me: many things don’t click.

On the other hand, this feeling of being lost has been my friend and companion for many years about many topics, some of which I feel far more comfortable with now, so I continue to seek the right metaphor, the right story, that resonates with me and helps me continue to make forward progress, and teach concepts in a way that seems to work for others besides myself. Hence the story you are about to embark on.

I understand enough to know that category theory is useful as an organizing principle, and I must admit that those aspects of it that have breached my blood-brain barrier have been useful for conceptualizing in the abstract. When explaining mathematical concepts, you will not see me drawing commuting diagrams to explain concepts (except when they are literally rudimentary category-theoretic concepts), but you will see me verbally or formulaically appeal to categorical concepts like “universal properties”, albeit in more concrete terms than that. So I persist, years later, in making progress, and perhaps some of my recent sense of progress will help someone else.

Given the above, take the below with the appropriate grains of the appropriate salts!

My Problem

During one of my bouts of “let’s try to learn some category theory again”, I was poring over Eilenberg and Mac Lane’s General Theory Of Natural Equivalences, the classic paper that makes the case for natural equivalence as an important and hitherto invisible concept. Along the way, the paper introduces natural transformations, which are the basis for defining natural equivalences.

The paper starts with an observation about vector spaces and constructions on vector spaces: schematic methods for producing one kind of vector space from another. They observe some peculiar and compelling relationships between not the spaces themselves, but rather the constructions themselves.

A discussion of the “simultaneous” or “natural” character of the isomorphism \(L \simeq T(T(L))\) clearly involves a simultaneous consideration of all spaces L and all transformations \(\lambda\) connecting them; this entails a simultaneous consideration of the conjugate spaces \(T(L)\) and the induced transformations \(T(\lambda)\) connecting them [emphasis mine]. 1

The relationship is captured in the form of schematic relationships that appeal to algebraic structure-preserving functions among the constructed spaces. I was unfamiliar with the concepts, but I could at least see how the authors recognized an interesting patterns of relationship among kinds of vector spaces, how in certain circumstances those relationships were highly constrained, forced to satisfy a kind of uniformity across pairs of vector spaces related to one another by this notion of “constructions”. They argue that such constrained relationships should be recognized, and show that fundamental to this recognition is how they can be represented using the language of vector space algebra, albeit clumsily and somewhat implicitly. This relationship goes beyond vector spaces: it arises among many varieties of mathematical concepts. But it seems to be this specific application that motivated the idea of “naturality”, whose formalization called for the definition of functors which in turn required the definition of categories. 2 As I understand it, capturing naturality was the focus, and the rest were necessary components. I have seen Mac Lane rumoured to have said as much, but have yet to track down the source.

So imagine my frustration that naturality, the most important of these concepts, is the one I understand the least. What is being transformed? How exactly is that transformation “natural”? I can see, by example, that there is a pattern to the relationships that are represented by natural transformations and natural equivalences, and by the continued effectiveness of its use that there is something worth articulating here. But I have never felt a visceral grasp for naturality.

Often this naturality is explained as...well natural (pun intended), and that you need to see many examples to get a feel for it. In programming languages theory, my research area, many folks will say to me that “natural transformations are like parametric polymorphic functions.” Indeed, parametric polymorphic functions (don’t worry if you don’t know what these are) are an important concept in PL, and they can be modeled as natural transformations. So this is an important exemplar. But this is still like telling me “A tree is like a Douglas Fir”: the example is the cue to understand the concept, but in an essential way this feels backwards and unfortunate. Indeed many concepts from nature yield messy taxonomies that can only be understood in terms of history and the web of instances, but we’re talking about pure mathematics here!

Another common effort to help explain natural transformations relates them to the concept of homotopy in algebraic topology. When I sat in on Steve Awodey’s (awesome) category theory class in my post-doc days, he too followed the introduction of the standard definition with the homotopy-looking characterization. Having previously learned about homotopies, I found this definition somewhat more comforting than the original, and working through its implications was quite satisfying (and heavily influenced what follows). I could get a sense of “continuous deformation-ishness” from this characterization, but I still could not see/feel the the full force of “simultanaeity” hidden in the definition.

I’ve wanted a story that I could find more compelling, and couched in terms of the abstraction itself, rather than PL theory or topology. Alongside good examples, I hope for this to help me better understand and explain naturality, rather than simply get used to it.

I think I finally found such a story, one that tickles my happy place. But it forced me to do some violence to the common presentation of categories, functors, and natural transformations. The result is an equivalent formalization of the concepts that better suits my present mental model. This kind of move is rather audacious, and quite frankly may lead to dead ends or problematic views that ultimately cause more conceptual harm than clarity. On the other hand, another perspective may draw out new or different connections, so I offer it for your chewing enjoyment.

So what follows is my own current personal answer to the title: “what is so natural about natural transformations?”. The following will be quite abstract, because it focuses simply on the abstract definitions and abstract properties of category-theoretic concepts. Perhaps in revision I’ll add examples, because that may help explain how I connect the concrete problems addressed by this abstraction to the abstraction itself. So this won’t make much sense if you don’t already have some background in the elementary basics of categories.

What’s a Category?

Despite the importance of natural transformations, the order in which the concepts are taught tends to follow the dependency order: categories then functors then natural transformations. The early concepts are needed to define the later ones. So let’s start with a definition of “category” and go from there. As you’ll see, my definition of a category is probably not yours, but it’s equivalent. Furthermore, at its core, it’s not actually mine: Saunders Mac Lane gives it as a brief aside in his famous book Category Theory for the Working Mathematician. But you’ll see that I’ve changed the terminology and notation for my own personal story-telling purposes: I find that the nuances of informal words used formally heavily influences my thinking about formalities (weak Sapir-Whorf anyone?), so rethinking the words we use for concepts helps me internalize them. 3 I apologize for the confusion this will inevitably cause to some readers. Just remember: my reason for going into a different definition of categories from the norm is that it leads quite directly to my answer to the question posed up-front. In fact, following this thread is how I came to discover this answer. Ok, buckle up!

Informally speaking, a category is:

  1. a collection of “stuff”;

  2. one binary relation among stuff; and

  3. three operators on stuff.

I will call an item of stuff an element: a “real” category theorist would call them “arrows” or “morphisms”. I’m not using those names because, those words don’t resonate with me or my inner story about what categories are, though I can see how they served Eilenberg and Mac Lane’s original examples well. History has demonstrated how much more widely these concepts can be applied, so I’m going for that level of generality

Our one (binary) relation among elements is identity. Given two elements, or more technically expressions that describe elements, they are either identical, meaning they both describe the same element or not, i.e. \(a = b\) or \(a \neq b\). The significance of identity is that given any proposition \(\Phi\) in the language of category theory, if \(a = b\) then you can replace every (free) instance of \(a\) with \(b\) because the two expressions are simply different names for the same element. This is kind of like aliasing when programming with pointers in an imperative language.

Now for the operations. All three operations take elements as inputs and (maybe) produce elements as output. The first two operations are unary, and total. Associated with every element \(a\) is a unique (in the sense of identity) left unit element \(L(a)\) and a unique right unit element \(R(a)\) (or “left element” and “right element” for short). Since these elements are unique, we represent leftness and rightness with operators rather than relations. For reference, actual category theorists would call \(L(a)\) the domain of \(a\), and \(R(a)\) the codomain of \(a\), and the domain and codomain would not be elements...err...arrows, but instead some other distinct kind of stuff called “objects”. As you’ll see, I’ve banned other stuff from my island. You can insert the dog-with-frisbee meme here: Elements?? No Objects!! ONLY ELEMENTS.

Elements, identity, and left and right elements suffice to introduce some of the properties required of \(R(a)\) and \(L(a)\). First, if element \(a\) is the left element \(L(b)\) or right element \(R(b)\) of some element \(b\), then \(L(a) = a\) and \(R(a) = a\). It’s worth noting that \(b\) could actually be \(a\), so if \(L(a) = a\) then we can deduce \(R(a) = a\) as well. That means, for example, that \[{L(R(L(L(R(L(R(R(a)))))))) = R(a)},\] for any \(a\). These “self-adjacent” elements, serve the purpose that objects do in most texts. They are usually called “identity arrows”, but for reasons that unfold below, I call them unit elements.

The third, and last, operation is binary and partial. To every two elements \(a\) and \(b\) such that \(R(a)\) = \(L(b)\), that is, for which the right element of \(a\) is identical to the left element of \(b\), we associate a unique composite element \(C(a,b)\). As-is typical, we sometimes use an infix notation \(a \mathbin{C}b\), especially when chaining compositions together. But I’ll use both prefix and infix notation interchangeably.

This raises an important (to me) conceptual-terminological point. I have strongly mixed feelings about the use of the word “composite” here, but I don’t see the possibility of a better word. In particular, the composite element \(C(a,b)\) is not composed of \(a\) and \(b\), in the sense that there are not little \(a\)’s and little \(b\)’s hiding inside of \(C(a,b)\). Rather, \(C(a,b)\) is “just” another name for some element of the category, and it’s that name that is composed of names for a and b, not the element itself. As we’ll see, every element can have many names, and the names are the things that have “compositional” structure.

My view of category theory is that it foregrounds a common and critical notion of compositional surrogative reasoning, in the sense that we reason about mathematical objects (or phenomena) by manipulating (textual) stand-ins for those actual objects. The first sense in which this is true is the difference between expressions in the language of category theory, and the elements of the category (the elements and the relationships represented by operators and relations) themselves. We are playing shadow puppets, using a highly expressive “API”, but don’t confuse the API with its implementation!

Thanks to the composition operator, one can reason about properties of some notion (like the element \(C(a,b)\)) in terms of relationships between other notions (like \(a\) and \(b\)), even if there is no literal component-wise structure in the formalism. This is why I emphasize that \(C(a,b)\) is associated with \(a\) and \(b\), not comprised of them. There may be other couples, triples, or quadruples of elements associated with that same element. Nonetheless, our elements are “atomic” in the sense that they have no discernible internal structure. The only direct analysis of elements is whether they are identical or not. So how does category theory connect to concrete mathematics? Well, that’s a story for another day, but I view categories much like I view relational database indexes: they’re a map, not the territory.

Despite everything I said about my not-real-composition perspective on composition, I still find it useful to think of these properties in real-composition terms. For instance, if we visually think of elements as Lego blocks, laid on one side, then the left (right) element can be viewed as describing the connection interface of the left (right) side of an element. Then we can imagine hooking the right of \(a\) to the left of \(b\) as possible when their descriptions are compatible, leaving the left of \(a\) still visible and available, and the right of \(b\) still visible and available. However, one must keep in mind that our notion of composition is not real hooking together of blocks: you can imagine \(C(a,b)\) as being a distinct larger monolithic Lego block that has no seams, but acts as though you had hooked an \(a\) and \(b\) together. So the composite is an element that looks “as if” it had been constructed by linking an \(a\) and \(b\), though one did not. What a strange Lego set, where you never compose, since you already have all the aggregates. This perspective guides how I interpret the properties of \(C\).

The first two properties completely determine the left and right elements of composites: \(L(C(a,b)) = L(a)\) and \(R(C(c,b)) = R(b)\). Given a bag of elements, each of which has known left and right elements, these properties constrain which associated composites are plausible.

We have more relationships that connect left and right elements to composites. The names left and right come from the relationship between these elements and composition, partly because of the above property, but also because of the following ones: \(C(L(a),a) = a\) and \(C(a,R(a)) = a\). So to use terms from algebra \(L(a)\) is a left unit of \(a\) with respect to \(C\), and \(R(a)\) is a right unit of \(a\) with respect to composition. These properties are what motivates the name unit elements because of their status as left or right units of some other element(s) with respect to composition. This further emphasizes the sense in which the element \(L(a)\) describes the left interface of \(a\) (and does nothing more), and \(R(a)\) describes the right interface of \(a\) and nothing more. My image is of an infinitesimally thin Lego block, so thin that its left interface and its right interface can’t help but be described by the same element...namely itself! And hooking it onto the left of something for which it is a left unit makes no perceptible difference, and same for the right. Stepping back for a moment, what fascinates me here is that some elements, namely the unit elements, can be used as both subjects of composition as well as descriptors for all and only those pairs of elements that yield well-formed compositions. I find this parsimony and economy of concepts viscerally satisfying, at least in the abstract.

Finally, the associations among elements that are described by composition carry a strong coherence property, which can be described succinctly, but I will spell it out in a way that better matches my thinking. Suppose three elements, \(a\), \(b\), and \(c\), and suppose that \(R(a) = L(b)\) and also \(R(b) = L(c)\). We assume nothing about \(L(a)\) or \(R(c)\) except that such elements must exist. Now, the above two equalities imply the existence of some composite element \(d = C(a,b)\) associated with \(a\) and \(b\), and some other composite element \(e = C(b,c)\) associated with \(b\) and \(c\). Now, because of the earlier composition properties, \(L(e) = L(C(b,c)) = L(b)\), so there must be an element \(f = C(a,e)\), and since \(R(d) = R(C(a,b)) = R(b)\), there must be an element \(g = C(d,b)\). The property we demand is that \(f = g\): these two distinct orders of composition yield identical associated elements. This property implies that given any sequence of elements \((a, b, c, ... z)\) such that the right element of one is identical to the left element of the subsequent element, then sequentially composing elements according to any strategy of compositions (e.g., first-to-last, last-to-first, middle-out, random) yields a unique associated element \(a \mathbin{C}b \mathbin{C}\cdot \mathbin{C}z\). We could concisely say that composition is “associative”: \(C(C(a,b),c) = C(a,C(b,c))\), but I like to draw this out and dwell on it. It’s a strong coherence property with respect to identity.

All of the above makes a category! Those who are used to category theory have surely noticed a few peculiarities of this definition. First, there’s the different naming to spark different intuitions. Terms like “arrows” or “morphisms”, have a connotation of going from one place to another, or performing a transformation, whereas “elements” have a connotation of simply “being” and having no internal structure. Elements don’t “do” anything, or contain anything: they merely observe external relationships with other elements. Second, we’ve gotten rid of “objects” as classifiers of compositions. Unit elements can do that just fine, and lead to a very different feel.

As mentioned, Mac Lane notes this definition, but goes with the traditional one: 4

Since the objects of a metacategory correspond exactly to its identity arrows, it is technically possible to dispense altogether with the objects and deal only with the arrows (CfTWMp. 9).

Indeed objects give a nice connection to sets and arrows to functions in categories associated with set-theoretic structures. But for understanding category theory in a way that is removed from its set-facing origins, and the mental biases that it brings (especially for an unabashed fan of set theory) I like to suppress those associations, and I feel that objects are a cognitive distractor: what fundamentally matters is the relationships among elements, in light of the structure mandated by the properties of left and right unit elements and composition.

In my mind, category theory is fundamentally about abstract interfaces among abstract elements, and one ought not worry about the idea, or even existence, of truly composite structures (the implementations). Focus on external relationships as the essence of categorical structure, rather than the internal composition of entities via construction of “objects”: that’s what set theory is for!

What’s a Functor?

We have now re-imagined categories as bags of inter-related elements with fundamental composition properties, described entirely in terms of elements. Let’s move on to our first kind of relationship between categories. Suppose we have two categories, \(\mathcal{C}\) and \(\mathcal{D}\). Each is a bag of elements, each has a notion of identity among elements within the category: there is simply no notion of identity among elements in distinct categories! As far as we’re concerned, the question doesn’t even make sense! Instead, we develop relationships among elements of distinct categories by adding more structure to our repertoire. For precision, we index the operations and relations of each category to emphasize that these relations and operations are well-defined only within the relevant category. For example, \(=_{\mathcal{C}}\) is the identity relation of category \(\mathcal{C}\). Relationships among categories are described by additional structure that is constrained by the details of the category-specific relationships.

A functor \(F\) describes a binary association between the elements of two categories. So to every element \(c\) of category \(\mathcal{C}\), functor \(F\) associates an element \(d\) of category \(\mathcal{D}\). Thus you can say that \(F\) has a total functional relation structure...it’s a total function, but only in the sense that a unique \(d\) is associated with each and every \(c\). I do not think of this as a procedure that operates on things, just a relationship.

We demand more from our functors than totality or functionality: a functor must also engage with the structure that the two categories possess: \(=_{\mathcal{C}}, L_{\mathcal{C}}, R_{\mathcal{C}},\) and \(C_{\mathcal{C}}\) for \(\mathcal{C}\), and \(=_{\mathcal{D}}, L_{\mathcal{D}}, R_{\mathcal{D}}\), and \(C_{\mathcal{D}}\) for \(\mathcal{D}\). First, \(F(R_{\mathcal{C}}(c)) =_{\mathcal{D}} R_{\mathcal{D}}(F(c))\) and \(F(L_{\mathcal{C}}(c)) =_{\mathcal{D}} L_{\mathcal{D}}(F(c)):\) left-elements and right-elements are preserved. Thankfully, because of the inherent properties of left-and right elements, we also get that unit elements of \(\mathcal{C}\) must map to unit elements of \(\mathcal{D}\), since \(R_{\mathcal{C}}(c) =_{\mathcal{C}} c\) implies \(R_{\mathcal{D}}(F(c)) =_{\mathcal{D}} F(c)\). The second (and last!) property is that if \({R(c_1) = L(c_2)}\), then \(F(C_{\mathcal{C}}(c_1,c_2)) = C_{\mathcal{D}}(F(c_1),F(c_2))\): associated composite elements are preserved.

From a purely technical standpoint, I much prefer this definition of functors, to standard definition, which requires thinking about mappings of two different kinds of things, and some unpleasant interactions and notational oddities that ensue. If you know, you know!

What’s a Natural Transformation, and What’s So “Natural” About It?

Ok, we’ve got categories, and we’ve got functors: it’s time for the main event! So what is a natural transformation? As with the last sections, I am simply going to give an abstract definition, without concern about its utility, only that it be equivalent to the standard one. Here’s where things get different.

Suppose you have two categories, \(\mathcal{C}\) and \(\mathcal{D}\), and you have two functors, \(F\) and \(G\), both of which describe relationships between \(\mathcal{C}\) and \(\mathcal{D}\), in that they describe structure-preserving mappings from elements of \(c\) to elements of \(d\). These are the core constituents of a natural transformation.

By my definition, a natural transformation is a fundamentally different kind of mapping from elements of \(\mathcal{C}\) to elements of \(\mathcal{D}\), constrained by the two functors. This sounds rather different from the usual presentation as being a relationship between functors, but only in the sense that the importance of the functor relationship is a bit more implicit. Nonetheless the existence of a natural transformation implies something about the two relevant functors, and the two relevant categories.

A natural transformation \(\eta\) subject to \(F\) on the left and \(G\) on the right is a total mapping from elements of \(c\) to elements of \(d\), much like each of \(F\) and \(G\) are, but with different properties. First, if \(c\) is an element of \(\mathcal{C}\) then \(L(\eta(c)) = F(L(c))\) and \(R(\eta(c)) = G(R(c))\): the two functors determine the left and right “interfaces” of the mapping for any given element.

Now here’s the punchline. Suppose \(c\) is an element of \(\mathcal{C}\), and that it can be described by the sequential compositions of \(n\) elements: \[c = c_1 \mathbin{\mathbin{C}_{\mathcal{C}}} \cdots \mathbin{\mathbin{C}_{\mathcal{C}}} c_{i-1} \mathbin{\mathbin{C}_{\mathcal{C}}} c_i \mathbin{\mathbin{C}_{\mathcal{C}}} c_{i+1} \mathbin{\mathbin{C}_{\mathcal{C}}} \cdots \mathbin{\mathbin{C}_{\mathcal{C}}} c_n,\] Then: \[\eta(c) = F(c_1) \mathbin{\mathbin{C}_{\mathcal{D}}} \cdots \mathbin{\mathbin{C}_{\mathcal{D}}} F(c_{i-1}) \mathbin{\mathbin{C}_{\mathcal{D}}} \eta(c_i) \mathbin{\mathbin{C}_{\mathcal{D}}} G(c_{i+1}) \mathbin{\mathbin{C}_{\mathcal{D}}} \cdots \mathbin{\mathbin{C}_{\mathcal{D}}} G(c_n),\] for any element \(c_i\) in the original sequence. This notation means that we can describe \(\eta(c)\) by mapping some element \(c_i\) of the sequence using \(\eta\), “recursively”, and mapping all elements to the left of it in the sequence (if any) using \(F\) and all elements to the right of it in the sequence (if any) using \(G\) and compose the results. The treatment of left and right elements by \(F\), \(\eta\), and \(G\) respectively guarantee that such a composition exists, so it’s the identicality that is the key. This property must hold for all possible composition-based descriptions of all elements of category \(\mathcal{C}\).

This property deserves some pondering to consider how wild and constraining it is! Consider our element \(c_i\). If \(c_i\) is described by another sequence of compositions in \(\mathcal{C}\), then \(\eta(c_i)\) in turn can be described by \(F\)-ing some elements of the sequence, \(\eta\)-ing one element of the sequence, and \(G\)-ing the rest of the elements. And we can replay this again. And again. And again! That is to say, the interaction that \(\eta\)’s mapping has with composition and the two functors is of a “fractal” or “self-similar” nature: we can keep “zooming in” to composition sequences and re-describing the aggregate association of \(\eta\) using a single application of \(\eta\) to “cross” from \(F\)’s actions to \(G\)’s. So this property touches every \(L_{\mathcal{C}}\),\(R_{\mathcal{C}}\), and \(C_{\mathcal{C}}\) relationship of \(\mathcal{C}\). As described here, and depends on the structure of \(\eta\)’s image in \(\mathcal{D}\). These properties are global and pervasive and simultaneous.

The naturality of natural transformations are often described according to a form of “simultaneous local coherence” properties in terms of only the analogue of unit elements (so-called “objects”). With some reasoning, we can show that a natural transformation of the kind I describe above can be fully determined by its treatment of unit elements, using a few “local” relationships about rotating which unit element gets mapped by \(\eta\). That characterization is quite compact, which gives a sense of parsimony. But to me it obscures the global implications of a natural transformation and how global the constraints fundamentally are, how “simultaneous” and “fractal” they are. That description is equivalent to this one, but I think it understates the significance of the simultaneity, by focusing the mind on unit elements, and providing a necessary and sufficient condition that doesn’t quite convey how all elements and their properties are brought to bear, in lock-step on the concept.

Naturality touches every element of the source category, and every description of every element of the source category. I think the path equation in particular underscores the extent to which a natural transformation is like a self-supporting arch: all of the pieces must fit together like clockwork to support the property. There can exist multiple natural transformations subject to two functors, but each one manifests a category-wide equilibrium that, at least intuitively speaking, would be difficult to imagine perturbing. Arches are beautiful and fascinating naturally-occurring phenomena. Their balance defies our intuitions and inspires humans to recreate that beauty in their own inventions. The same seems to be true for natural transformations: their structure was at first discovered, rather than explicitly developed, but now they are devised with full awareness by informed minds.

A reasonable question to ask: is this global formulation of natural transformations useful? To be honest I don’t know. The original motivations for natural transformations (natural equivalence) were fundamentally based on abstracting constructions on sets, and this same facet manifests in discussions of parametric polymorphic functions. Perhaps parametric polymorphic functions can be reconceived as a different form of “liftings-plus-embeddings” applied to arbitrary functions, subject to two (functorial) type constructors, rather than as a “generic function” applied to individual types (i.e., sets of values, independent of all possible mappings from them). This conception might make parametric reasoning properties more explicit. But I haven’t given this much thought (yet, I hope). I can only speculate at this point.

Thanks to Yanze Li for typo fixes!


  1. In this quote, “transformation” refers to structure-preserving mappings between vector spaces, natural transformations.↩︎

  2. Well actually, Eilenberg and Mac Lane have a 1942 article that identifies the concept of naturality, where they focus on group theory. My bad!↩︎

  3. Lucky for me I’m not alone. Here are the statisticians Bruno de Finetti and L. J. Savage on this matter: “Every progress in scientific thought involves struggle against the distortion of outlook generated by the habits of ordinary language, the deficiencies and captiousness of which must be denounced, combatted, and overcome, case by case.”↩︎

  4. Freyd and Scedrov also begin their book Categories, Allegories with an analogous definition of categories, though on a cursory look, they seem to revert to a standard characterization shortly thereafter (thanks to Andrej Bauer for that information!).↩︎

Monday, 1 July 2024

A Proof of Proof by Infinite Descent

WARNING: the following contains a whole lot of pedantry about proving theorems at a level of detail such that you could likely convince a computer proof assistant of their correctness. I teach a course where students learn to write such proofs without computer assistance, because doing so can profoundly affect how one perceives the possibilities for formal definition and proof. 1 Sometimes the default reasoning principles provided by a proof assistant can restrict one’s perspective of what reasoning principles are acceptable, how and why they might be judged acceptable, and how to devise and validate new ones.

More relevant to the present post is that even the literature and pedagogy of writing paper proofs is rife with defaults. I spend a lot of time teaching about proof by induction, and I find that my students, most of whom have learned something about proving stuff in the past, often arrive constrained by a very regimented approach to induction, involving mandatory “base cases” and “induction cases” and these nebulous “induction hypotheses” which seem to descend from the sky when invoked by arcane ceremony. I have some strong feelings about this, mostly due to my own struggles with gaining comfort with induction as a student, so I try to help students evade those struggles.

Taken at face value, this post is about some other proof principle, proof by infinite descent, but at its heart it’s really about induction. Ideally by the end of this post, I will have challenged, but also shed light on, the nature of inductive proof cases and induction hypotheses. As a bonus, you might see how induction can be used to explain another proof principle that intuitively seems valid, but for which it may not be obvious how to justify it formally.

Two Proofs That \(\sqrt{2}\) is Not Rational

In an excellent blog post about the difference between proof of negation and proof by contradiction, Andrej Bauer gives as an example a proof that \(\sqrt{2}\) is not a rational number. The article is worth a read, but I would note in particular that in a technical sense, his proof is not a proof that it is an irrational number, just that it is not a rational number. I imagine that one could also prove that \(\sqrt{-1}\) is not rational, but that had better not count as a proof that it is irrational! However, if we also prove that \(\sqrt{2}\) is a real number, which could be done using Dedekind cuts for instance, then we can deduce that \(\sqrt{2}\) is indeed irrational. And to no one’s surprise, we can prove that \(\sqrt{-1}\) is not a real number.

Theorem:\(\sqrt{2}\) is not rational.
Proof. Suppose \(\sqrt{2}\) were equal to a fraction \(\frac{a}{b}\) with \(a\) and \(b\) relatively prime. Then we would get \(a^2 = 2b^2\), hence \(a^2\) is even and so is \(a\). Write \(a = 2c\) and plug it back in to get \(2c^2 = b^2\), from which we conclude that \(b\) is even as well. This is a contradiction since \(a\) and \(b\) were assumed to be relatively prime. QED.

This is a proof of the negation of some proposition, i.e., that assuming the truth of the proposition implies a contradiction. Let’s consider some of its details.

First, the proof assumes that \(\sqrt{2}\) is equal to a fraction \(\frac{a}{b}\), which implicitly implies that \(a\) and \(b\) are integers and that \(b\) is not 0. We could limit the two numbers to be naturals, by which I mean non-negative integers, applying the convention that \(\sqrt{2}\) denotes specifically the positive root, but since there exists a negative number that when squared equals 2, and since generalizing the claim plays into my overall story, let’s make no commitment to whether \(a\) and \(b\) are positive or negative.

Next, the proof uses the fact that every rational number can be represented by a fraction \(\frac{a}{b}\) such that \(a\) and \(b\) are relatively prime, in other words that no positive integer other than 1 divides them evenly, which is a precise way of saying that \(\frac{a}{b}\) is in reduced form. This is a fact that needs to be proved separately and prior to this proof. We could instead ntegrate this reasoning directly into the proof: omit the relative primality assumption, and then let \(a' = a/\gcd(a,b)\) and \(b' = b/\gcd(a,b)\) where \(gcd(a,b)\) is the greatest common divisor of both. It then follows from some proof effort that \(a'\) and \(b'\) are relatively prime and \(\frac{a'}{b'} = \frac{a}{b}\), thus completing the general proof. However, we’re still using an auxiliary fact that can be proven by induction: that \(a\) and \(b\) have a unique greatest common divisor. More on that below.

From these assumptions and facts, the proof deduces that both \(a\) and \(b\) are even, which means that \(2\) divides both of them, which contradicts the assumption that among positive integers only 1 can do this. So assuming that \(\sqrt{2}\) is rational leads to contradiction.

Now to drive the topic of this post, here’s a more annoying proof of the same fact:

Proof. Suppose \(\sqrt{2}\) were equal to a fraction \(\frac{a}{b}\). Then \(a\) can’t be \(0\) because \(\frac{0}{b}^2 \neq 2\). Then we would get \(a^2 = 2b^2\), hence \(a^2\) is even and so is \(a\). Write \(a = 2c\) and plug it back in to get \(2c^2 = b^2\), from which we conclude that \(b\) is even as well. Write \(b = 2d\) and plug it back in to get \(c^2 = 2d^2\). This yields a contradiction because we could divide our integers by \(2\) forever, which is impossible. QED.

On the upside, this proof gets to omit the assumption that \(a\) and \(b\) are relatively prime, which in turn means we don’t need to consult Wikipedia to remind ourselves what the heck “relatively prime” means anyway. But on the other hand...what just happened?!? Intuitively this makes sense, but is this legit?

While Andrej’s proof is a direct proof of negation, this annoying one is an example of proof by infinite descent, which according to Wikipedia is a particular kind of proof by contradiction, and relies on something called the the well-ordering principle. By the end of this post, I hope to convince you that: a) it is not a kind of proof by contradiction: rather it’s a technique for proving negation, so this claim deserves some finger-wagging from Andrej. b) it does not rely on the well-ordering principle, and good thing that because if it did, then the annoying proof would be broken, but it’s not.

We should be able to ascertain from a formal statement of the principle that it is simply a proof of negation. Andrej’s post elaborates on the source of confusion. And what is the problem with the claim that proof by infinite descent relies on well-ordering, whatever that is? To explain, let’s first pinpoint a more fundamental and general property, called well-foundedness, and reasoning principle, well-founded induction, that the principle of infinite descent can be based upon and even generalized, and use that to explain the confusion. Along the way, we’ll produce a different explanation of proof by infinite descent in terms of induction: an explanation that suits my induction-happy brain better. Also, we’ll get the opportunity to investigate induction itself: in particular, I want to demonstrate that the common “base case/induction case” recipe for induction is a useful but limiting view of how induction can be applied.

Well-foundedness and Well-ordering

As you might guess from the above two proofs, they depend on some notion of “getting smaller”, and in the case of natural numbers the default notion is the less-than relation \(<\), but less-than-ness is just an example of a more general phenomenon. So first, let’s isolate the property of \(<\) on natural numbers that really powers induction. A binary relation \(\sqsubset\) is called well-founded if all leftward chains of the form \(\dots \sqsubset x_3 \sqsubset x_2 \sqsubset x_1\) must be finite in length, so there cannot exist any infinite descending chains. The \(<\) relation on natural numbers has this property, because starting from any \(n\) and producing smaller numbers and stopping whenever you want, you will always have to stop, either necessarily at \(0\) or earlier by choice. On the other hand, \(<\) extended to integers does not have this property because you can go forever down the negative numbers, e.g., \(\dots -16 < -8 < -4 < -2\). Luckily, \(<\) isn’t the only binary relation on integers! Here’s a different one that is well-founded: if \(z_1,z_2 \in \mathbb{Z}\), say \(z_1 \mathbin{\mathbin{\lvert<\rvert}}z_2\) if and only if \(\lvert z_1\rvert < \lvert z_2\rvert\), where \(\lvert z\rvert\) denotes the absolute value of \(z\) and \(<\) is restricted to natural numbers. Then \(2 \mathbin{\mathbin{\lvert<\rvert}}3\) but also \(-2 \mathbin{\mathbin{\lvert<\rvert}}3\), \(2 \mathbin{\mathbin{\lvert<\rvert}}{-3}\), and \(-2 \mathbin{\mathbin{\lvert<\rvert}}-3\): the sign of the integer is ignored! Notice, though, that \(2\) and \(-2\) are not related by \(\mathbin{\mathbin{\lvert<\rvert}}\) in either direction. Since every leftward chain of the \(\mathbin{\mathbin{\lvert<\rvert}}\) relation works its way toward \(0\), the center of the number line, we can see that all such chains must be finite.

Since “well-ordering” and “well-founded” both start with “well-”, you might suspect that there’s a relationship between them, and you’d be right. Well-orderedness is described a variety of ways in the literature, but for our purposes the following is the most relevant formulation: a well-founded relation \(\sqsubset\) is a well-ordering if and only if it relates every distinct pair of elements, i.e. \(x_1 \neq x_2\) implies either \(x_1 \sqsubset x_2\) or \(x_2 \sqsubset x_1\). So based on what we’ve observed, \(<\) is well-ordered, but \(\mathbin{\mathbin{\lvert<\rvert}}\) is not (remember \(2\) and \(-2\)?).

To tie up a loose terminological end, consider the “-ordering” and “-founded” suffixes. Well-orderings must be transitive: if \(x_1 \sqsubset x_2\) and \(x_2 \sqsubset x_3\) then \(x_1 \sqsubset x_3\). The consequent must hold because if it didn’t, then totality would force \(x_3 \sqsubset x_1\), but that would introduce a three-element cycle \(\dots \sqsubset x_3 \sqsubset x_1 \sqsubset x_2 \sqsubset x_3\) and break well-foundedness. Then, given some well-ordering \(\sqsubset\), extending it to the relation \(\sqsubseteq\), which extends \(\sqsubset\) to be reflexive, is a total order, i.e., a partial order (reflexive, transitive, antisymmetric relation) that is also total. On the other hand, a well-founded relation does not have to be transitive, let alone total, so extending one in this manner is not even guaranteed to get you a partial order, let alone a total one. Hence the suffix “-founded”.

Proof by Infinite Descent, Generalized

Now that we have some grasp of well-foundedness, and why well-orderedness is a red herring, Let’s use it to state the Principle of Proof by Infinite Descent precisely and generally.

Theorem: Principle of Infinite Descent
Let \(X\) be a set and \(\sqsubset\) a well-founded relation on \(X\). Furthermore let \(\Phi(x)\) be some property of elements \(x\) of \(X\). Then if for every \(x\) in \(X\), \(\Phi(x)\) implies \(\Phi(y)\) for some \(y \sqsubset x\), it follows that \(\Phi(x)\) does not hold for any \(x\) in \(X\).

Stated in formal logical notation: \[(\forall x\in X.\, \Phi(x) \implies \exists y \in X.\, y \sqsubset x \land \Phi(y)) \implies \forall x\in X.\, \neg\Phi(x).\]

The premise of the principle formalizes the idea that “no matter where you are (which x), you can still make a ’descending’ step.”, and the conclusion is that “well, if so then your property can’t apply to any x: it leads to contradiction.” Arguably a better name for the principle would be Proof by No Infinite Descent (thanks to Andrej for this observation!).

Our proof that \(\sqrt{2}\) is not rational implicitly applies this principle: Let \(X\) be the integers \(\mathbb{Z}\); for integers \(a,b\), let \(a \sqsubset b\) be \(a \mathbin{\mathbin{\lvert<\rvert}}b\); and let the property \(\Phi(a)\) be “\(a^2 = 2b^2\) for some integer \(b\)” (formally, \(\exists b\in \mathbb{Z}.\,a^2 = 2b^2\)). Assuming \(a \neq 0\) and taking \(b\), the proof uses division by \(2\) to produce a \(c \mathbin{\mathbin{\lvert<\rvert}}a\) for which some \(d\) satisfies the equation. Having satisfied the premises of the principle, we use it to deduce that no \(a\) satisfies the property (though admittedly the prose proof shrouds this use of the principle in squishy “forever is impossible” terminology).

The intuition for the principle is pretty straightforward: if the premises hold, then in principle we can build an infinite descending \(\mathbin{\mathbin{\lvert<\rvert}}\)-sequence of integers that satisfy the property. But there are no such infinite sequences, because we assumed that all of them must be finite, so a contradiction must follow.

But let’s buttress this intuition with rigour. We can prove this principle as a consequence of:

The Principle of Well-founded Induction: Let \(X\) be a set and \(\sqsubset\) a well-founded relation on \(X\). Then let \(\Psi(x)\) be some property of elements \(x\) of \(X\). Then if for every \(x\) in \(X\), \(\Psi(y)\) holding for every \(y \sqsubset x\) suffices to prove \(\Psi(x)\), it follows that \(\Psi(x)\) holds for every \(x\) in \(X\).

Stated in formal logical notation: \[(\forall x\in X.\, (\forall y \in X.\, y \sqsubset x \implies \Psi(y)) \implies \Psi(x)) \implies \forall x\in X.\, \Psi(x).\]

This principle is quite abstract, since it’s stated in terms of a general principle. To clarify, let’s instantiate it with particular well-founded relations that lead to principles with which we are familiar. We already observed that \(<\) is a well-founded relation on natural numbers (because all well-orderings are also well-founded), we can specialize well-founded induction to get what is often called strong mathematical induction or course-of-values induction on natural numbers: \[(\forall n\in \mathbb{N}.\, ((\forall m \in \mathbb{N}.\, m < n \implies \Psi(m)) \implies \Psi(n)) \implies \forall n\in \mathbb{N}.\, \Psi(n).\]

The general structure of a proof by course-of-values induction is to assume some natural number \(n\), then prove that assuming that \(\Psi(m)\) holds for every number less than \(n\) suffices to prove \(\Psi(n)\). Achieving that suffices to prove that \(\Psi\) holds for every natural number \(n\). Observe that the assumption “\(\Psi(m)\) holds for every number less than \(n\)” is what gets called an “induction hypothesis”. In the common style of prose proofs, the “induction hypothesis” often gets evoked in the middle of an argument, and for a specific value, rather than explicitly assumed at the beginning of the argument in full generality. To me this always seemed to me to be some magical thing that got pulled out of the air in the middle of the proof, leaving me unsure and about what kind of thing induction even was. But in fact the induction hypothesis is just an assumption that you make as you attempt to prove \(\Psi(n)\), and its justification is hidden in the proof of the principle of well-founded induction. Surprise: you can prove induction principles correct, and devise and prove correct new ones: they need not be taken as axioms!

Though students have often seen strong mathematical induction, they are more familiar with plain ole’ mathematical induction, where the “induction hypothesis” is to assume that a property holds for a natural number one less than the current one. We can get this by using the well-founded relation \(n_1 <_1 n_2\) which holds if and only if \(n_2 = n_1 + 1\). This relation is neither total nor an order since it only ever relates abutting natural numbers. But, curiously enough, we can still plug it into the Principle of Proof by Infinite Descent if we want, and get something meaningful, albeit more restrictive, than using \(<\).

Advertisement: I first learned about this perspective on induction from Glynn Winskel’s excellent textbook, The Formal Semantics of Programming Languages. I highly recommend it.

What even is a “base case”?

A common structure used in proofs that proceed by either mathematical induction or course-of-values induction, or at least in teaching such proofs, is to assume \(n\), assume the induction hypothesis, and then proceed by cases on whether \(n = 0\) or whether \(n > 0\). The first case gets called the “base case” and is proven outright, and the second case gets called the “inductive case” which exploits “the induction hypothesis” to complete that case. When using induction more generally, as is common in PL theory, this notion of base cases and induction cases gets emphasized as well. However, it’s worth looking back at the principle of well-founded induction and observing that nothing about that theorem forces a particular case analysis, or restricts the availability of the premise that gets called “the induction hypothesis”. The funny thing is that the “base case” also has a perfectly fine induction hypothesis, but typically it is useless because most of the time the proof in question is not equipped to conjure an \(m < n\) with which to exploit it.

This template for applying the proof by course-of-values induction is so common that often the split into base case vs. inductive case, where the base case does not appeal to induction, gets baked right into the explanation of the proof principle, as if it were part of the approach. I find this categorical specialization to be unfortunate for two reasons.

The first reason is that sometimes a proof needs a case split, but this particular case split is the wrong one: in that it does not help push the proof through. My favourite demonstration of this phenomenon is a proof that any two positive natural numbers have a greatest common divisor. This proof could be extended to integers, and thus apply to Andrej’s proof as discussed above, but doing so would deviate more from my main point than I would like. The proof involves a helper lemma followed by a main theorem.

Lemma. For all positive natural numbers \(a\) and for all positive natural numbers \(b\), if \(a \geq b\) then there exists a positive natural number \(c\) such that \(xc = a\) for some positive natural \(x\) and \(yc = b\) for some positive natural \(y\), and furthermore if \(d\) is a positive natural number that satisfies these properties, then \(c \geq d\).

This lemma can be proven by course-of-values induction. Then the statement of the main theorem has the same structure, but does not assume \(a \geq b\). It can be proven by splitting cases on whether \(a \geq b\) or \(b \geq a\) and then applying the lemma with the two numbers appropriately ordered (either order will work if the numbers are equal).

I’m only going to give a sketch of the lemma’s proof, since it suffices to make the relevant point. It proceeds by course-of-values induction on \(a\) where the relevant property of a is the substatement : “for all positive natural numbers \(b\) ...” (now you know why I explicitly separated the two variable quantifications in the prose statement of the lemma). After assuming the induction hypothesis, assuming \(b\) and assuming \(a \geq b\), we proceed by cases on whether \(a = b\) or \(a > b\). If \(a = b\) then we can show that the greatest common divisor is the number itself, so \(a\). If \(a > b\), then we can show that the greatest common divisor of \(a-b\) and \(b\) is also the greatest common divisor of \(a\) and \(b\). However, to appropriately apply the induction hypothesis, we must first determine which of \(a-b\) and \(b\) is larger: since \(b\) is positive, we know that \(a-b < a\), and we previously assumed \(b < a\), so either can play the role of “smaller \(a\)”, but to make sufficient progress with the induction hypothesis, we need “next \(b\)” to be less than “smaller \(a\)”, so we split cases on whether \(a-b < a\) or not, arbitrarily letting the alternate case also handle \(a-b = a\).

The interesting part of the above proof is that two cases are considered, and only one of them appeals to the induction hypothesis. But the case that does not is not an \(a=0\) case as per the default template. In fact \(a = 0\) could never apply since we assumed that \(a\) is positive.

For exactness sake, it’s worth pointing out that one can perfectly well devise a principle of course-of-values induction that applies to just the positive integers, such that \(a = 1\) would be viewed as the “base case”. If we were to separate out that case, we could deduce that \(b = 1\) because it must be a positive integer, and it must be less than or equal \(a\). So in practice, the \(a = 1\) case can get handled using the exact same reasoning as the other \(a = b\) cases, which happen to circumscribe those situations where no appeal to an “induction hypothesis” is required.

So the lesson to be drawn from my first gripe is that when performing induction, if we reconceive of “base cases” to mean those cases that do not appeal to an induction hypothesis, then such cases need not coincide with the cases that involve the minimal elements of the well-founded relation (or the unique minimum element of a well-ordered relation). For many proofs the default heuristic helps, but that default is not a universal law of induction.

No base case? no problem!

To reiterate, my first problem with how course-of-values induction is taught is that the default base-case/inductive-case split may not be the most fruitful one. The second problem I have is that you may not even need such a split. Usually it is assumed that there must be some “base case”, that does not get to exploit the induction hypothesis, for how could it ever apply? But in some proofs you can always exploit the induction hypothesis. This is exactly the case in the following proof of the Principle of Infinite Descent that uses the Principle of Well-founded Induction.

Theorem: Principle of Infinite Descent \[(\forall x\in X.\, \Phi(x) \implies \exists y \in X.\, y \sqsubset x \land \Phi(y)) \implies \forall x\in X.\, \neg\Phi(x).\]

Proof. Suppose \((\forall x\in X.\, \Phi(x) \implies \exists y \in X.\, y \sqsubset x \land \Phi(y))\). Now we prove \(\forall x\in X.\, \neg\Phi(x)\) by well-founded induction on \(x\in X\). We will use as our predicate \(\Psi(x) := \neg\Phi(x)\), which is synonymous with \(\Phi(x) \implies \bot\), i.e. \(\Phi(x)\) implies contradiction. Suppose \(x \in X\), and suppose \(((\forall y \in X.\, y \sqsubset x \implies \neg\Phi(y))\) At this point it suffices to show \(\neg\Phi(x)\), for which it suffices to assume \(\Phi(x)\) and deduce a contradiction. So suppose \(\Phi(x)\). Now, applying our first assumption to this particular \(x\), we learn \(\Phi(x) \implies \exists y \in X.\, y \sqsubset x \land \Phi(y))\), and combining this with our most recent assumption, we learn that \(\exists y \in X.\, y \sqsubset x \land \Phi(y)\), so consider that \(y \in X\). Since \(y \sqsubset x\), we can use the induction hypothesis to prove \(\neg\Phi(y)\). But we also have \(\Phi(y)\), to which we can apply \(\neg\Phi(y)\) to deduce a contradiction. QED.

This proof of proof by infinite descent used course of values induction, but did not need to consider distinct cases related to \(x \in X\). There is no “base case” here, in the sense that no part of the proof must be separately completed without the help of an induction hypothesis. Now, when specialized to natural numbers, one could do so, breaking out the \(0\) case and arguing directly that there is no smaller number, without reference at all to \(\Phi(x)\). Or more nebulously, we can split on “minimal elements” versus “non-minimal elements”, but to what end?

Part way back around, without proof by contradiction

The following is a bonus: thanks to Andrej who made the observation that inspired it.

We’ve proven that if \(\sqsubset\) is well-founded, then principle of infinite descent applies. What about the reverse? The answer is “it depends.” In particular, if you prefer to work without proof by contradiction (i.e. constructively), then you can only get part way there. But with some use of proof by contradiction, you can get all the way there. In particular, we can prove that for an arbitrary binary relation \(R\) on a set \(X\), if it satisfies the statement of the principle of infinite descent, then it has no infinite descending chains. Wait, isn’t that enough? Well, no, not in a world where you want to work constructively, without proof by contradiction. I sneakily defined well-foundedness as the assertion that all chains are finite, which is a stronger statement than the assertion that there can be no infinite descending chains, so long as you avoid proof by contradiction. The positive conception is nice because it works in both contexts, whereas “no infinite descending chains”, which is often taken as the definition of well-foundedness in classical logic, does not.

Ok on to the proof. First, let’s get precise about descending chains. Define a descending chain of a binary relation \(R\) to be a partial function \(f : \mathbb{N} \rightharpoonup X\) such that for all natural numbers \(n\), if \(f(n+1)\) is defined, then so is \(f(n)\), and furthermore \(f(n+1) \mathrel{R} f(n)\). Then an infinite descending chain is a descending chain that is defined for every natural number \(n\).

Now, we can prove that if \(R\) is an arbitrary binary relation (rather than a well-founded onw) that nonetheless satisfies the letter of the principle of infinite descent, then \(R\) has no infinite descending chains.

Proof. Suppose \(R \subseteq X\times X\), and suppose that for any property \(\Phi(x)\): \[(\forall x\in X.\, \Phi(x) \implies \exists y \in X.\, y \mathrel{R} x \land \Phi(y)) \implies \forall x\in X.\, \neg\Phi(x).\] Then it suffices to assume \(f : \mathbb{N} \to X\) is an infinite descending chain and then deduce a contradiction. To do so, it further suffices to let \(\Phi(x) := \exists n\in\mathbb{N}.f(n)=x\) and prove the premise of the proof-by-infinite-descent-shaped assumption, because from that we can deduce that \(\forall x\in X.\neg\exists n\in N.\,f(n)=x\), (i.e. \(f\) is totally undefined, the empty partial function) which combined with \(x = f(0)\) and \(n = 0\) suffices to deduce a contradiction.

Let’s prove the sufficient condition: suppose \(x \in X\) and \(\Phi(x)\), i.e. that \(f(n) = x\) for some \(n \in \mathbb{N}\). Then \(\Phi(f(n+1)\) holds, and by assumption \(f(n+1) \mathrel{R} f(n)\), which can be combined to prove the existential conclusion. QED.

Coda

According to John Stillwell’s Elements of Algebra (Section 2.10), Blaise Pascal was “the first to use induction in the ’base case, induction step’ format” in a paper from 1654. But our modern understanding of induction has it’s origins in Richard Dedekind’s seminal 1888 book “Was sind und was sollen die Zahlen?” (sometimes translated “What are the numbers and what are they for?”). Many defaults—be they notation, techniques, and explanations—in mathematics have historical origins that precede a more sophisticated understanding of phenomena, and sometimes an impedance mismatch between the historical and the modern can limit our understanding (treatment of variables and functions by mathematical texts is one of my standard whipping horses, but that’s a topic for another day). Those perspectives persist because they were useful, and are still useful today, but it also helps to consider how they might best be incorporated with those refined conceptions we’ve gained since their inception.

As for the use of proof by infinite descent versus a proof like Andrej’s that argues directly from minimality, one could reasonably argue that explicitly calling out the existence of a minimal case is more intuitively graspable for some arguments. It’s not clear to me whether arguing from the perspective of “there must be one or more minimal cases” versus “you can’t go down this way forever” is more generally clear, especially once you have come to understand proof by infinite descent. Furthermore, in the case of general well-foundedness, there may not be a single smallest case, but rather a (possibly) infinite number of minimal cases, in which case calling all of them out abstractly may be less compelling than calling out a singular minimum as was the case (pun intended) for the natural numbers. Which argument is most enlightening probably depends on personal taste.

Acknowledgments

Many thanks to Andrej Bauer for feedback on this post! Thanks to folks on Hacker News who noted that the annoying proof did not address \(a=0\).


  1. For those interested in precise paper proofs, I highly recommend Leslie Lamport’s talk and paper on the topic.↩︎