Sunday, 12 January 2025

What's So "Natural" About Natural Transformations?!?

I’m probably the wrong person to listen to when it comes to anything about that extension of abstract algebra we call category theory. I’ve been trying to learn about it for literally decades, and much of that time has involved an uncomfortable feeling that something fundamental about the material is just bouncing off of me. Contrast this with many of my colleagues who find great comfort in conceiving of mathematical concepts in the terminology of this area, and feel unsure whether they understand some concept before it has been rephrased in category-theoretic terms. Unfortunately these explanations (which scroll past me on social media and group conversations and Q&A at seminars, and...and...) don’t seem to be at the right level or angle to scaffold me up further into the topic. I experience this regularly and in some ways I am intensely jealous. Something about this doesn’t click for me: many things don’t click.

On the other hand, this feeling of being lost has been my friend and companion for many years about many topics, some of which I feel far more comfortable with now, so I continue to seek the right metaphor, the right story, that resonates with me and helps me continue to make forward progress, and teach concepts in a way that seems to work for others besides myself. Hence the story you are about to embark on.

I understand enough to know that category theory is useful as an organizing principle, and I must admit that those aspects of it that have breached my blood-brain barrier have been useful for conceptualizing in the abstract. When explaining mathematical concepts, you will not see me drawing commuting diagrams to explain concepts (except when they are literally rudimentary category-theoretic concepts), but you will see me verbally or formulaically appeal to categorical concepts like “universal properties”, albeit in more concrete terms than that. So I persist, years later, in making progress, and perhaps some of my recent sense of progress will help someone else.

Given the above, take the below with the appropriate grains of the appropriate salts!

My Problem

During one of my bouts of “let’s try to learn some category theory again”, I was poring over Eilenberg and Mac Lane’s General Theory Of Natural Equivalences, the classic paper that makes the case for natural equivalence as an important and hitherto invisible concept. Along the way, the paper introduces natural transformations, which are the basis for defining natural equivalences.

The paper starts with an observation about vector spaces and constructions on vector spaces: schematic methods for producing one kind of vector space from another. They observe some peculiar and compelling relationships between not the spaces themselves, but rather the constructions themselves.

A discussion of the “simultaneous” or “natural” character of the isomorphism $L ≃ T (T (L))$ clearly involves a simultaneous consideration of all spaces L and all transformations $λ$ connecting them; this entails a simultaneous consideration of the conjugate spaces $T (L)$ and the induced transformations $T (λ)$ connecting them [emphasis mine]. ¹

The relationship is captured in the form of schematic relationships that appeal to algebraic structure-preserving functions among the constructed spaces. I was unfamiliar with the concepts, but I could at least see how the authors recognized an interesting patterns of relationship among kinds of vector spaces, how in certain circumstances those relationships were highly constrained, forced to satisfy a kind of uniformity across pairs of vector spaces related to one another by this notion of “constructions”. They argue that such constrained relationships should be recognized, and show that fundamental to this recognition is how they can be represented using the language of vector space algebra, albeit clumsily and somewhat implicitly. This relationship goes beyond vector spaces: it arises among many varieties of mathematical concepts. But it seems to be this specific application that motivated the idea of “naturality”, whose formalization called for the definition of functors which in turn required the definition of categories. ² As I understand it, capturing naturality was the focus, and the rest were necessary components. I have seen Mac Lane rumoured to have said as much, but have yet to track down the source.

So imagine my frustration that naturality, the most important of these concepts, is the one I understand the least. What is being transformed? How exactly is that transformation “natural”? I can see, by example, that there is a pattern to the relationships that are represented by natural transformations and natural equivalences, and by the continued effectiveness of its use that there is something worth articulating here. But I have never felt a visceral grasp for naturality.

Often this naturality is explained as...well natural (pun intended), and that you need to see many examples to get a feel for it. In programming languages theory, my research area, many folks will say to me that “natural transformations are like parametric polymorphic functions.” Indeed, parametric polymorphic functions (don’t worry if you don’t know what these are) are an important concept in PL, and they can be modeled as natural transformations. So this is an important exemplar. But this is still like telling me “A tree is like a Douglas Fir”: the example is the cue to understand the concept, but in an essential way this feels backwards and unfortunate. Indeed many concepts from nature yield messy taxonomies that can only be understood in terms of history and the web of instances, but we’re talking about pure mathematics here!

Another common effort to help explain natural transformations relates them to the concept of homotopy in algebraic topology. When I sat in on Steve Awodey’s (awesome) category theory class in my post-doc days, he too followed the introduction of the standard definition with the homotopy-looking characterization. Having previously learned about homotopies, I found this definition somewhat more comforting than the original, and working through its implications was quite satisfying (and heavily influenced what follows). I could get a sense of “continuous deformation-ishness” from this characterization, but I still could not see/feel the the full force of “simultanaeity” hidden in the definition.

I’ve wanted a story that I could find more compelling, and couched in terms of the abstraction itself, rather than PL theory or topology. Alongside good examples, I hope for this to help me better understand and explain naturality, rather than simply get used to it.

I think I finally found such a story, one that tickles my happy place. But it forced me to do some violence to the common presentation of categories, functors, and natural transformations. The result is an equivalent formalization of the concepts that better suits my present mental model. This kind of move is rather audacious, and quite frankly may lead to dead ends or problematic views that ultimately cause more conceptual harm than clarity. On the other hand, another perspective may draw out new or different connections, so I offer it for your chewing enjoyment.

So what follows is my own current personal answer to the title: “what is so natural about natural transformations?”. The following will be quite abstract, because it focuses simply on the abstract definitions and abstract properties of category-theoretic concepts. Perhaps in revision I’ll add examples, because that may help explain how I connect the concrete problems addressed by this abstraction to the abstraction itself. So this won’t make much sense if you don’t already have some background in the elementary basics of categories.

What’s a Category?

Despite the importance of natural transformations, the order in which the concepts are taught tends to follow the dependency order: categories then functors then natural transformations. The early concepts are needed to define the later ones. So let’s start with a definition of “category” and go from there. As you’ll see, my definition of a category is probably not yours, but it’s equivalent. Furthermore, at its core, it’s not actually mine: Saunders Mac Lane gives it as a brief aside in his famous book Category Theory for the Working Mathematician. But you’ll see that I’ve changed the terminology and notation for my own personal story-telling purposes: I find that the nuances of informal words used formally heavily influences my thinking about formalities (weak Sapir-Whorf anyone?), so rethinking the words we use for concepts helps me internalize them. ³ I apologize for the confusion this will inevitably cause to some readers. Just remember: my reason for going into a different definition of categories from the norm is that it leads quite directly to my answer to the question posed up-front. In fact, following this thread is how I came to discover this answer. Ok, buckle up!

Informally speaking, a category is:

a collection of “stuff”;
one binary relation among stuff; and
three operators on stuff.

I will call an item of stuff an element: a “real” category theorist would call them “arrows” or “morphisms”. I’m not using those names because, those words don’t resonate with me or my inner story about what categories are, though I can see how they served Eilenberg and Mac Lane’s original examples well. History has demonstrated how much more widely these concepts can be applied, so I’m going for that level of generality

Our one (binary) relation among elements is identity. Given two elements, or more technically expressions that describe elements, they are either identical, meaning they both describe the same element or not, i.e. $a = b$ or $a \neq b$ . The significance of identity is that given any proposition $Φ$ in the language of category theory, if $a = b$ then you can replace every (free) instance of $a$ with $b$ because the two expressions are simply different names for the same element. This is kind of like aliasing when programming with pointers in an imperative language.

Now for the operations. All three operations take elements as inputs and (maybe) produce elements as output. The first two operations are unary, and total. Associated with every element $a$ is a unique (in the sense of identity) left unit element $L (a)$ and a unique right unit element $R (a)$ (or “left element” and “right element” for short). Since these elements are unique, we represent leftness and rightness with operators rather than relations. For reference, actual category theorists would call $L (a)$ the domain of $a$ , and $R (a)$ the codomain of $a$ , and the domain and codomain would not be elements...err...arrows, but instead some other distinct kind of stuff called “objects”. As you’ll see, I’ve banned other stuff from my island. You can insert the dog-with-frisbee meme here: Elements?? No Objects!! ONLY ELEMENTS.

Elements, identity, and left and right elements suffice to introduce some of the properties required of $R (a)$ and $L (a)$ . First, if element $a$ is the left element $L (b)$ or right element $R (b)$ of some element $b$ , then $L (a) = a$ and $R (a) = a$ . It’s worth noting that $b$ could actually be $a$ , so if $L (a) = a$ then we can deduce $R (a) = a$ as well. That means, for example, that $L (R (L (L (R (L (R (R (a)))))))) = R (a),$ for any $a$ . These “self-adjacent” elements, serve the purpose that objects do in most texts. They are usually called “identity arrows”, but for reasons that unfold below, I call them unit elements.

The third, and last, operation is binary and partial. To every two elements $a$ and $b$ such that $R (a)$ = $L (b)$ , that is, for which the right element of $a$ is identical to the left element of $b$ , we associate a unique composite element $C (a, b)$ . As-is typical, we sometimes use an infix notation $a C b$ , especially when chaining compositions together. But I’ll use both prefix and infix notation interchangeably.

This raises an important (to me) conceptual-terminological point. I have strongly mixed feelings about the use of the word “composite” here, but I don’t see the possibility of a better word. In particular, the composite element $C (a, b)$ is not composed of $a$ and $b$ , in the sense that there are not little $a$ ’s and little $b$ ’s hiding inside of $C (a, b)$ . Rather, $C (a, b)$ is “just” another name for some element of the category, and it’s that name that is composed of names for a and b, not the element itself. As we’ll see, every element can have many names, and the names are the things that have “compositional” structure.

My view of category theory is that it foregrounds a common and critical notion of compositional surrogative reasoning, in the sense that we reason about mathematical objects (or phenomena) by manipulating (textual) stand-ins for those actual objects. The first sense in which this is true is the difference between expressions in the language of category theory, and the elements of the category (the elements and the relationships represented by operators and relations) themselves. We are playing shadow puppets, using a highly expressive “API”, but don’t confuse the API with its implementation!

Thanks to the composition operator, one can reason about properties of some notion (like the element $C (a, b)$ ) in terms of relationships between other notions (like $a$ and $b$ ), even if there is no literal component-wise structure in the formalism. This is why I emphasize that $C (a, b)$ is associated with $a$ and $b$ , not comprised of them. There may be other couples, triples, or quadruples of elements associated with that same element. Nonetheless, our elements are “atomic” in the sense that they have no discernible internal structure. The only direct analysis of elements is whether they are identical or not. So how does category theory connect to concrete mathematics? Well, that’s a story for another day, but I view categories much like I view relational database indexes: they’re a map, not the territory.

Despite everything I said about my not-real-composition perspective on composition, I still find it useful to think of these properties in real-composition terms. For instance, if we visually think of elements as Lego blocks, laid on one side, then the left (right) element can be viewed as describing the connection interface of the left (right) side of an element. Then we can imagine hooking the right of $a$ to the left of $b$ as possible when their descriptions are compatible, leaving the left of $a$ still visible and available, and the right of $b$ still visible and available. However, one must keep in mind that our notion of composition is not real hooking together of blocks: you can imagine $C (a, b)$ as being a distinct larger monolithic Lego block that has no seams, but acts as though you had hooked an $a$ and $b$ together. So the composite is an element that looks “as if” it had been constructed by linking an $a$ and $b$ , though one did not. What a strange Lego set, where you never compose, since you already have all the aggregates. This perspective guides how I interpret the properties of $C$ .

The first two properties completely determine the left and right elements of composites: $L (C (a, b)) = L (a)$ and $R (C (c, b)) = R (b)$ . Given a bag of elements, each of which has known left and right elements, these properties constrain which associated composites are plausible.

We have more relationships that connect left and right elements to composites. The names left and right come from the relationship between these elements and composition, partly because of the above property, but also because of the following ones: $C (L (a), a) = a$ and $C (a, R (a)) = a$ . So to use terms from algebra $L (a)$ is a left unit of $a$ with respect to $C$ , and $R (a)$ is a right unit of $a$ with respect to composition. These properties are what motivates the name unit elements because of their status as left or right units of some other element(s) with respect to composition. This further emphasizes the sense in which the element $L (a)$ describes the left interface of $a$ (and does nothing more), and $R (a)$ describes the right interface of $a$ and nothing more. My image is of an infinitesimally thin Lego block, so thin that its left interface and its right interface can’t help but be described by the same element...namely itself! And hooking it onto the left of something for which it is a left unit makes no perceptible difference, and same for the right. Stepping back for a moment, what fascinates me here is that some elements, namely the unit elements, can be used as both subjects of composition as well as descriptors for all and only those pairs of elements that yield well-formed compositions. I find this parsimony and economy of concepts viscerally satisfying, at least in the abstract.

Finally, the associations among elements that are described by composition carry a strong coherence property, which can be described succinctly, but I will spell it out in a way that better matches my thinking. Suppose three elements, $a$ , $b$ , and $c$ , and suppose that $R (a) = L (b)$ and also $R (b) = L (c)$ . We assume nothing about $L (a)$ or $R (c)$ except that such elements must exist. Now, the above two equalities imply the existence of some composite element $d = C (a, b)$ associated with $a$ and $b$ , and some other composite element $e = C (b, c)$ associated with $b$ and $c$ . Now, because of the earlier composition properties, $L (e) = L (C (b, c)) = L (b)$ , so there must be an element $f = C (a, e)$ , and since $R (d) = R (C (a, b)) = R (b)$ , there must be an element $g = C (d, b)$ . The property we demand is that $f = g$ : these two distinct orders of composition yield identical associated elements. This property implies that given any sequence of elements $(a, b, c, . . . z)$ such that the right element of one is identical to the left element of the subsequent element, then sequentially composing elements according to any strategy of compositions (e.g., first-to-last, last-to-first, middle-out, random) yields a unique associated element $a C b C \cdot C z$ . We could concisely say that composition is “associative”: $C (C (a, b), c) = C (a, C (b, c))$ , but I like to draw this out and dwell on it. It’s a strong coherence property with respect to identity.

All of the above makes a category! Those who are used to category theory have surely noticed a few peculiarities of this definition. First, there’s the different naming to spark different intuitions. Terms like “arrows” or “morphisms”, have a connotation of going from one place to another, or performing a transformation, whereas “elements” have a connotation of simply “being” and having no internal structure. Elements don’t “do” anything, or contain anything: they merely observe external relationships with other elements. Second, we’ve gotten rid of “objects” as classifiers of compositions. Unit elements can do that just fine, and lead to a very different feel.

As mentioned, Mac Lane notes this definition, but goes with the traditional one: ⁴

Since the objects of a metacategory correspond exactly to its identity arrows, it is technically possible to dispense altogether with the objects and deal only with the arrows (CfTWMp. 9).

Indeed objects give a nice connection to sets and arrows to functions in categories associated with set-theoretic structures. But for understanding category theory in a way that is removed from its set-facing origins, and the mental biases that it brings (especially for an unabashed fan of set theory) I like to suppress those associations, and I feel that objects are a cognitive distractor: what fundamentally matters is the relationships among elements, in light of the structure mandated by the properties of left and right unit elements and composition.

In my mind, category theory is fundamentally about abstract interfaces among abstract elements, and one ought not worry about the idea, or even existence, of truly composite structures (the implementations). Focus on external relationships as the essence of categorical structure, rather than the internal composition of entities via construction of “objects”: that’s what set theory is for!

What’s a Functor?

We have now re-imagined categories as bags of inter-related elements with fundamental composition properties, described entirely in terms of elements. Let’s move on to our first kind of relationship between categories. Suppose we have two categories, $C$ and $D$ . Each is a bag of elements, each has a notion of identity among elements within the category: there is simply no notion of identity among elements in distinct categories! As far as we’re concerned, the question doesn’t even make sense! Instead, we develop relationships among elements of distinct categories by adding more structure to our repertoire. For precision, we index the operations and relations of each category to emphasize that these relations and operations are well-defined only within the relevant category. For example, $=_{C}$ is the identity relation of category $C$ . Relationships among categories are described by additional structure that is constrained by the details of the category-specific relationships.

A functor $F$ describes a binary association between the elements of two categories. So to every element $c$ of category $C$ , functor $F$ associates an element $d$ of category $D$ . Thus you can say that $F$ has a total functional relation structure...it’s a total function, but only in the sense that a unique $d$ is associated with each and every $c$ . I do not think of this as a procedure that operates on things, just a relationship.

We demand more from our functors than totality or functionality: a functor must also engage with the structure that the two categories possess: $=_{C}, L_{C}, R_{C},$ and $C_{C}$ for $C$ , and $=_{D}, L_{D}, R_{D}$ , and $C_{D}$ for $D$ . First, $F (R_{C} (c)) =_{D} R_{D} (F (c))$ and $F (L_{C} (c)) =_{D} L_{D} (F (c)) :$ left-elements and right-elements are preserved. Thankfully, because of the inherent properties of left-and right elements, we also get that unit elements of $C$ must map to unit elements of $D$ , since $R_{C} (c) =_{C} c$ implies $R_{D} (F (c)) =_{D} F (c)$ . The second (and last!) property is that if $R (c_{1}) = L (c_{2})$ , then $F (C_{C} (c_{1}, c_{2})) = C_{D} (F (c_{1}), F (c_{2}))$ : associated composite elements are preserved.

From a purely technical standpoint, I much prefer this definition of functors, to standard definition, which requires thinking about mappings of two different kinds of things, and some unpleasant interactions and notational oddities that ensue. If you know, you know!

What’s a Natural Transformation, and What’s So “Natural” About It?

Ok, we’ve got categories, and we’ve got functors: it’s time for the main event! So what is a natural transformation? As with the last sections, I am simply going to give an abstract definition, without concern about its utility, only that it be equivalent to the standard one. Here’s where things get different.

Suppose you have two categories, $C$ and $D$ , and you have two functors, $F$ and $G$ , both of which describe relationships between $C$ and $D$ , in that they describe structure-preserving mappings from elements of $c$ to elements of $d$ . These are the core constituents of a natural transformation.

By my definition, a natural transformation is a fundamentally different kind of mapping from elements of $C$ to elements of $D$ , constrained by the two functors. This sounds rather different from the usual presentation as being a relationship between functors, but only in the sense that the importance of the functor relationship is a bit more implicit. Nonetheless the existence of a natural transformation implies something about the two relevant functors, and the two relevant categories.

A natural transformation $η$ subject to $F$ on the left and $G$ on the right is a total mapping from elements of $c$ to elements of $d$ , much like each of $F$ and $G$ are, but with different properties. First, if $c$ is an element of $C$ then $L (η (c)) = F (L (c))$ and $R (η (c)) = G (R (c))$ : the two functors determine the left and right “interfaces” of the mapping for any given element.

Now here’s the punchline. Suppose $c$ is an element of $C$ , and that it can be described by the sequential compositions of $n$ elements: $c = c_{1} C_{C} \dots C_{C} c_{i - 1} C_{C} c_{i} C_{C} c_{i + 1} C_{C} \dots C_{C} c_{n},$ Then: $η (c) = F (c_{1}) C_{D} \dots C_{D} F (c_{i - 1}) C_{D} η (c_{i}) C_{D} G (c_{i + 1}) C_{D} \dots C_{D} G (c_{n}),$ for any element $c_{i}$ in the original sequence. This notation means that we can describe $η (c)$ by mapping some element $c_{i}$ of the sequence using $η$ , “recursively”, and mapping all elements to the left of it in the sequence (if any) using $F$ and all elements to the right of it in the sequence (if any) using $G$ and compose the results. The treatment of left and right elements by $F$ , $η$ , and $G$ respectively guarantee that such a composition exists, so it’s the identicality that is the key. This property must hold for all possible composition-based descriptions of all elements of category $C$ .

This property deserves some pondering to consider how wild and constraining it is! Consider our element $c_{i}$ . If $c_{i}$ is described by another sequence of compositions in $C$ , then $η (c_{i})$ in turn can be described by $F$ -ing some elements of the sequence, $η$ -ing one element of the sequence, and $G$ -ing the rest of the elements. And we can replay this again. And again. And again! That is to say, the interaction that $η$ ’s mapping has with composition and the two functors is of a “fractal” or “self-similar” nature: we can keep “zooming in” to composition sequences and re-describing the aggregate association of $η$ using a single application of $η$ to “cross” from $F$ ’s actions to $G$ ’s. So this property touches every $L_{C}$ , $R_{C}$ , and $C_{C}$ relationship of $C$ . As described here, and depends on the structure of $η$ ’s image in $D$ . These properties are global and pervasive and simultaneous.

The naturality of natural transformations are often described according to a form of “simultaneous local coherence” properties in terms of only the analogue of unit elements (so-called “objects”). With some reasoning, we can show that a natural transformation of the kind I describe above can be fully determined by its treatment of unit elements, using a few “local” relationships about rotating which unit element gets mapped by $η$ . That characterization is quite compact, which gives a sense of parsimony. But to me it obscures the global implications of a natural transformation and how global the constraints fundamentally are, how “simultaneous” and “fractal” they are. That description is equivalent to this one, but I think it understates the significance of the simultaneity, by focusing the mind on unit elements, and providing a necessary and sufficient condition that doesn’t quite convey how all elements and their properties are brought to bear, in lock-step on the concept.

Naturality touches every element of the source category, and every description of every element of the source category. I think the path equation in particular underscores the extent to which a natural transformation is like a self-supporting arch: all of the pieces must fit together like clockwork to support the property. There can exist multiple natural transformations subject to two functors, but each one manifests a category-wide equilibrium that, at least intuitively speaking, would be difficult to imagine perturbing. Arches are beautiful and fascinating naturally-occurring phenomena. Their balance defies our intuitions and inspires humans to recreate that beauty in their own inventions. The same seems to be true for natural transformations: their structure was at first discovered, rather than explicitly developed, but now they are devised with full awareness by informed minds.

A reasonable question to ask: is this global formulation of natural transformations useful? To be honest I don’t know. The original motivations for natural transformations (natural equivalence) were fundamentally based on abstracting constructions on sets, and this same facet manifests in discussions of parametric polymorphic functions. Perhaps parametric polymorphic functions can be reconceived as a different form of “liftings-plus-embeddings” applied to arbitrary functions, subject to two (functorial) type constructors, rather than as a “generic function” applied to individual types (i.e., sets of values, independent of all possible mappings from them). This conception might make parametric reasoning properties more explicit. But I haven’t given this much thought (yet, I hope). I can only speculate at this point.

Thanks to Yanze Li for typo fixes!

In this quote, “transformation” refers to structure-preserving mappings between vector spaces, natural transformations.↩︎
Well actually, Eilenberg and Mac Lane have a 1942 article that identifies the concept of naturality, where they focus on group theory. My bad!↩︎
Lucky for me I’m not alone. Here are the statisticians Bruno de Finetti and L. J. Savage on this matter: “Every progress in scientific thought involves struggle against the distortion of outlook generated by the habits of ordinary language, the deficiencies and captiousness of which must be denounced, combatted, and overcome, case by case.”↩︎
Freyd and Scedrov also begin their book Categories, Allegories with an analogous definition of categories, though on a cursory look, they seem to revert to a standard characterization shortly thereafter (thanks to Andrej Bauer for that information!).↩︎

Monday, 1 July 2024

A Proof of Proof by Infinite Descent

WARNING: the following contains a whole lot of pedantry about proving theorems at a level of detail such that you could likely convince a computer proof assistant of their correctness. I teach a course where students learn to write such proofs without computer assistance, because doing so can profoundly affect how one perceives the possibilities for formal definition and proof. ¹ Sometimes the default reasoning principles provided by a proof assistant can restrict one’s perspective of what reasoning principles are acceptable, how and why they might be judged acceptable, and how to devise and validate new ones.

More relevant to the present post is that even the literature and pedagogy of writing paper proofs is rife with defaults. I spend a lot of time teaching about proof by induction, and I find that my students, most of whom have learned something about proving stuff in the past, often arrive constrained by a very regimented approach to induction, involving mandatory “base cases” and “induction cases” and these nebulous “induction hypotheses” which seem to descend from the sky when invoked by arcane ceremony. I have some strong feelings about this, mostly due to my own struggles with gaining comfort with induction as a student, so I try to help students evade those struggles.

Taken at face value, this post is about some other proof principle, proof by infinite descent, but at its heart it’s really about induction. Ideally by the end of this post, I will have challenged, but also shed light on, the nature of inductive proof cases and induction hypotheses. As a bonus, you might see how induction can be used to explain another proof principle that intuitively seems valid, but for which it may not be obvious how to justify it formally.

Two Proofs That $\sqrt{2}$ is Not Rational

In an excellent blog post about the difference between proof of negation and proof by contradiction, Andrej Bauer gives as an example a proof that $\sqrt{2}$ is not a rational number. The article is worth a read, but I would note in particular that in a technical sense, his proof is not a proof that it is an irrational number, just that it is not a rational number. I imagine that one could also prove that $\sqrt{- 1}$ is not rational, but that had better not count as a proof that it is irrational! However, if we also prove that $\sqrt{2}$ is a real number, which could be done using Dedekind cuts for instance, then we can deduce that $\sqrt{2}$ is indeed irrational. And to no one’s surprise, we can prove that $\sqrt{- 1}$ is not a real number.

Theorem: $\sqrt{2}$ is not rational.
Proof. Suppose $\sqrt{2}$ were equal to a fraction $\frac{a}{b}$ with $a$ and $b$ relatively prime. Then we would get $a^{2} = 2 b^{2}$ , hence $a^{2}$ is even and so is $a$ . Write $a = 2 c$ and plug it back in to get $2 c^{2} = b^{2}$ , from which we conclude that $b$ is even as well. This is a contradiction since $a$ and $b$ were assumed to be relatively prime. QED.

This is a proof of the negation of some proposition, i.e., that assuming the truth of the proposition implies a contradiction. Let’s consider some of its details.

First, the proof assumes that $\sqrt{2}$ is equal to a fraction $\frac{a}{b}$ , which implicitly implies that $a$ and $b$ are integers and that $b$ is not 0. We could limit the two numbers to be naturals, by which I mean non-negative integers, applying the convention that $\sqrt{2}$ denotes specifically the positive root, but since there exists a negative number that when squared equals 2, and since generalizing the claim plays into my overall story, let’s make no commitment to whether $a$ and $b$ are positive or negative.

Next, the proof uses the fact that every rational number can be represented by a fraction $\frac{a}{b}$ such that $a$ and $b$ are relatively prime, in other words that no positive integer other than 1 divides them evenly, which is a precise way of saying that $\frac{a}{b}$ is in reduced form. This is a fact that needs to be proved separately and prior to this proof. We could instead ntegrate this reasoning directly into the proof: omit the relative primality assumption, and then let $a^{'} = a / gcd (a, b)$ and $b^{'} = b / gcd (a, b)$ where $g c d (a, b)$ is the greatest common divisor of both. It then follows from some proof effort that $a^{'}$ and $b^{'}$ are relatively prime and $\frac{a^{'}}{b^{'}} = \frac{a}{b}$ , thus completing the general proof. However, we’re still using an auxiliary fact that can be proven by induction: that $a$ and $b$ have a unique greatest common divisor. More on that below.

From these assumptions and facts, the proof deduces that both $a$ and $b$ are even, which means that $2$ divides both of them, which contradicts the assumption that among positive integers only 1 can do this. So assuming that $\sqrt{2}$ is rational leads to contradiction.

Now to drive the topic of this post, here’s a more annoying proof of the same fact:

Proof. Suppose $\sqrt{2}$ were equal to a fraction $\frac{a}{b}$ . Then $a$ can’t be $0$ because ${\frac{0}{b}}^{2} \neq 2$ . Then we would get $a^{2} = 2 b^{2}$ , hence $a^{2}$ is even and so is $a$ . Write $a = 2 c$ and plug it back in to get $2 c^{2} = b^{2}$ , from which we conclude that $b$ is even as well. Write $b = 2 d$ and plug it back in to get $c^{2} = 2 d^{2}$ . This yields a contradiction because we could divide our integers by $2$ forever, which is impossible. QED.

On the upside, this proof gets to omit the assumption that $a$ and $b$ are relatively prime, which in turn means we don’t need to consult Wikipedia to remind ourselves what the heck “relatively prime” means anyway. But on the other hand...what just happened?!? Intuitively this makes sense, but is this legit?

While Andrej’s proof is a direct proof of negation, this annoying one is an example of proof by infinite descent, which according to Wikipedia is a particular kind of proof by contradiction, and relies on something called the the well-ordering principle. By the end of this post, I hope to convince you that: a) it is not a kind of proof by contradiction: rather it’s a technique for proving negation, so this claim deserves some finger-wagging from Andrej. b) it does not rely on the well-ordering principle, and good thing that because if it did, then the annoying proof would be broken, but it’s not.

We should be able to ascertain from a formal statement of the principle that it is simply a proof of negation. Andrej’s post elaborates on the source of confusion. And what is the problem with the claim that proof by infinite descent relies on well-ordering, whatever that is? To explain, let’s first pinpoint a more fundamental and general property, called well-foundedness, and reasoning principle, well-founded induction, that the principle of infinite descent can be based upon and even generalized, and use that to explain the confusion. Along the way, we’ll produce a different explanation of proof by infinite descent in terms of induction: an explanation that suits my induction-happy brain better. Also, we’ll get the opportunity to investigate induction itself: in particular, I want to demonstrate that the common “base case/induction case” recipe for induction is a useful but limiting view of how induction can be applied.

Well-foundedness and Well-ordering

As you might guess from the above two proofs, they depend on some notion of “getting smaller”, and in the case of natural numbers the default notion is the less-than relation $<$ , but less-than-ness is just an example of a more general phenomenon. So first, let’s isolate the property of $<$ on natural numbers that really powers induction. A binary relation $⊏$ is called well-founded if all leftward chains of the form $\dots ⊏ x_{3} ⊏ x_{2} ⊏ x_{1}$ must be finite in length, so there cannot exist any infinite descending chains. The $<$ relation on natural numbers has this property, because starting from any $n$ and producing smaller numbers and stopping whenever you want, you will always have to stop, either necessarily at $0$ or earlier by choice. On the other hand, $<$ extended to integers does not have this property because you can go forever down the negative numbers, e.g., $\dots - 16 < - 8 < - 4 < - 2$ . Luckily, $<$ isn’t the only binary relation on integers! Here’s a different one that is well-founded: if $z_{1}, z_{2} \in Z$ , say $z_{1} | < | z_{2}$ if and only if $| z_{1} | < | z_{2} |$ , where $| z |$ denotes the absolute value of $z$ and $<$ is restricted to natural numbers. Then $2 | < | 3$ but also $- 2 | < | 3$ , $2 | < | - 3$ , and $- 2 | < | - 3$ : the sign of the integer is ignored! Notice, though, that $2$ and $- 2$ are not related by $| < |$ in either direction. Since every leftward chain of the $| < |$ relation works its way toward $0$ , the center of the number line, we can see that all such chains must be finite.

Since “well-ordering” and “well-founded” both start with “well-”, you might suspect that there’s a relationship between them, and you’d be right. Well-orderedness is described a variety of ways in the literature, but for our purposes the following is the most relevant formulation: a well-founded relation $⊏$ is a well-ordering if and only if it relates every distinct pair of elements, i.e. $x_{1} \neq x_{2}$ implies either $x_{1} ⊏ x_{2}$ or $x_{2} ⊏ x_{1}$ . So based on what we’ve observed, $<$ is well-ordered, but $| < |$ is not (remember $2$ and $- 2$ ?).

To tie up a loose terminological end, consider the “-ordering” and “-founded” suffixes. Well-orderings must be transitive: if $x_{1} ⊏ x_{2}$ and $x_{2} ⊏ x_{3}$ then $x_{1} ⊏ x_{3}$ . The consequent must hold because if it didn’t, then totality would force $x_{3} ⊏ x_{1}$ , but that would introduce a three-element cycle $\dots ⊏ x_{3} ⊏ x_{1} ⊏ x_{2} ⊏ x_{3}$ and break well-foundedness. Then, given some well-ordering $⊏$ , extending it to the relation $⊑$ , which extends $⊏$ to be reflexive, is a total order, i.e., a partial order (reflexive, transitive, antisymmetric relation) that is also total. On the other hand, a well-founded relation does not have to be transitive, let alone total, so extending one in this manner is not even guaranteed to get you a partial order, let alone a total one. Hence the suffix “-founded”.

Proof by Infinite Descent, Generalized

Now that we have some grasp of well-foundedness, and why well-orderedness is a red herring, Let’s use it to state the Principle of Proof by Infinite Descent precisely and generally.

Theorem: Principle of Infinite Descent
Let $X$ be a set and $⊏$ a well-founded relation on $X$ . Furthermore let $Φ (x)$ be some property of elements $x$ of $X$ . Then if for every $x$ in $X$ , $Φ (x)$ implies $Φ (y)$ for some $y ⊏ x$ , it follows that $Φ (x)$ does not hold for any $x$ in $X$ .

Stated in formal logical notation: $(\forall x \in X . Φ (x) ⟹ \exists y \in X . y ⊏ x \land Φ (y)) ⟹ \forall x \in X . \neg Φ (x) .$

The premise of the principle formalizes the idea that “no matter where you are (which x), you can still make a ’descending’ step.”, and the conclusion is that “well, if so then your property can’t apply to any x: it leads to contradiction.” Arguably a better name for the principle would be Proof by No Infinite Descent (thanks to Andrej for this observation!).

Our proof that $\sqrt{2}$ is not rational implicitly applies this principle: Let $X$ be the integers $Z$ ; for integers $a, b$ , let $a ⊏ b$ be $a | < | b$ ; and let the property $Φ (a)$ be “ $a^{2} = 2 b^{2}$ for some integer $b$ ” (formally, $\exists b \in Z . a^{2} = 2 b^{2}$ ). Assuming $a \neq 0$ and taking $b$ , the proof uses division by $2$ to produce a $c | < | a$ for which some $d$ satisfies the equation. Having satisfied the premises of the principle, we use it to deduce that no $a$ satisfies the property (though admittedly the prose proof shrouds this use of the principle in squishy “forever is impossible” terminology).

The intuition for the principle is pretty straightforward: if the premises hold, then in principle we can build an infinite descending $| < |$ -sequence of integers that satisfy the property. But there are no such infinite sequences, because we assumed that all of them must be finite, so a contradiction must follow.

But let’s buttress this intuition with rigour. We can prove this principle as a consequence of:

The Principle of Well-founded Induction: Let $X$ be a set and $⊏$ a well-founded relation on $X$ . Then let $Ψ (x)$ be some property of elements $x$ of $X$ . Then if for every $x$ in $X$ , $Ψ (y)$ holding for every $y ⊏ x$ suffices to prove $Ψ (x)$ , it follows that $Ψ (x)$ holds for every $x$ in $X$ .

Stated in formal logical notation: $(\forall x \in X . (\forall y \in X . y ⊏ x ⟹ Ψ (y)) ⟹ Ψ (x)) ⟹ \forall x \in X . Ψ (x) .$

This principle is quite abstract, since it’s stated in terms of a general principle. To clarify, let’s instantiate it with particular well-founded relations that lead to principles with which we are familiar. We already observed that $<$ is a well-founded relation on natural numbers (because all well-orderings are also well-founded), we can specialize well-founded induction to get what is often called strong mathematical induction or course-of-values induction on natural numbers: $(\forall n \in N . ((\forall m \in N . m < n ⟹ Ψ (m)) ⟹ Ψ (n)) ⟹ \forall n \in N . Ψ (n) .$

The general structure of a proof by course-of-values induction is to assume some natural number $n$ , then prove that assuming that $Ψ (m)$ holds for every number less than $n$ suffices to prove $Ψ (n)$ . Achieving that suffices to prove that $Ψ$ holds for every natural number $n$ . Observe that the assumption “ $Ψ (m)$ holds for every number less than $n$ ” is what gets called an “induction hypothesis”. In the common style of prose proofs, the “induction hypothesis” often gets evoked in the middle of an argument, and for a specific value, rather than explicitly assumed at the beginning of the argument in full generality. To me this always seemed to me to be some magical thing that got pulled out of the air in the middle of the proof, leaving me unsure and about what kind of thing induction even was. But in fact the induction hypothesis is just an assumption that you make as you attempt to prove $Ψ (n)$ , and its justification is hidden in the proof of the principle of well-founded induction. Surprise: you can prove induction principles correct, and devise and prove correct new ones: they need not be taken as axioms!

Though students have often seen strong mathematical induction, they are more familiar with plain ole’ mathematical induction, where the “induction hypothesis” is to assume that a property holds for a natural number one less than the current one. We can get this by using the well-founded relation $n_{1} <_{1} n_{2}$ which holds if and only if $n_{2} = n_{1} + 1$ . This relation is neither total nor an order since it only ever relates abutting natural numbers. But, curiously enough, we can still plug it into the Principle of Proof by Infinite Descent if we want, and get something meaningful, albeit more restrictive, than using $<$ .

Advertisement: I first learned about this perspective on induction from Glynn Winskel’s excellent textbook, The Formal Semantics of Programming Languages. I highly recommend it.

What even is a “base case”?

A common structure used in proofs that proceed by either mathematical induction or course-of-values induction, or at least in teaching such proofs, is to assume $n$ , assume the induction hypothesis, and then proceed by cases on whether $n = 0$ or whether $n > 0$ . The first case gets called the “base case” and is proven outright, and the second case gets called the “inductive case” which exploits “the induction hypothesis” to complete that case. When using induction more generally, as is common in PL theory, this notion of base cases and induction cases gets emphasized as well. However, it’s worth looking back at the principle of well-founded induction and observing that nothing about that theorem forces a particular case analysis, or restricts the availability of the premise that gets called “the induction hypothesis”. The funny thing is that the “base case” also has a perfectly fine induction hypothesis, but typically it is useless because most of the time the proof in question is not equipped to conjure an $m < n$ with which to exploit it.

This template for applying the proof by course-of-values induction is so common that often the split into base case vs. inductive case, where the base case does not appeal to induction, gets baked right into the explanation of the proof principle, as if it were part of the approach. I find this categorical specialization to be unfortunate for two reasons.

The first reason is that sometimes a proof needs a case split, but this particular case split is the wrong one: in that it does not help push the proof through. My favourite demonstration of this phenomenon is a proof that any two positive natural numbers have a greatest common divisor. This proof could be extended to integers, and thus apply to Andrej’s proof as discussed above, but doing so would deviate more from my main point than I would like. The proof involves a helper lemma followed by a main theorem.

Lemma. For all positive natural numbers $a$ and for all positive natural numbers $b$ , if $a \geq b$ then there exists a positive natural number $c$ such that $x c = a$ for some positive natural $x$ and $y c = b$ for some positive natural $y$ , and furthermore if $d$ is a positive natural number that satisfies these properties, then $c \geq d$ .

This lemma can be proven by course-of-values induction. Then the statement of the main theorem has the same structure, but does not assume $a \geq b$ . It can be proven by splitting cases on whether $a \geq b$ or $b \geq a$ and then applying the lemma with the two numbers appropriately ordered (either order will work if the numbers are equal).

I’m only going to give a sketch of the lemma’s proof, since it suffices to make the relevant point. It proceeds by course-of-values induction on $a$ where the relevant property of a is the substatement : “for all positive natural numbers $b$ ...” (now you know why I explicitly separated the two variable quantifications in the prose statement of the lemma). After assuming the induction hypothesis, assuming $b$ and assuming $a \geq b$ , we proceed by cases on whether $a = b$ or $a > b$ . If $a = b$ then we can show that the greatest common divisor is the number itself, so $a$ . If $a > b$ , then we can show that the greatest common divisor of $a - b$ and $b$ is also the greatest common divisor of $a$ and $b$ . However, to appropriately apply the induction hypothesis, we must first determine which of $a - b$ and $b$ is larger: since $b$ is positive, we know that $a - b < a$ , and we previously assumed $b < a$ , so either can play the role of “smaller $a$ ”, but to make sufficient progress with the induction hypothesis, we need “next $b$ ” to be less than “smaller $a$ ”, so we split cases on whether $a - b < a$ or not, arbitrarily letting the alternate case also handle $a - b = a$ .

The interesting part of the above proof is that two cases are considered, and only one of them appeals to the induction hypothesis. But the case that does not is not an $a = 0$ case as per the default template. In fact $a = 0$ could never apply since we assumed that $a$ is positive.

For exactness sake, it’s worth pointing out that one can perfectly well devise a principle of course-of-values induction that applies to just the positive integers, such that $a = 1$ would be viewed as the “base case”. If we were to separate out that case, we could deduce that $b = 1$ because it must be a positive integer, and it must be less than or equal $a$ . So in practice, the $a = 1$ case can get handled using the exact same reasoning as the other $a = b$ cases, which happen to circumscribe those situations where no appeal to an “induction hypothesis” is required.

So the lesson to be drawn from my first gripe is that when performing induction, if we reconceive of “base cases” to mean those cases that do not appeal to an induction hypothesis, then such cases need not coincide with the cases that involve the minimal elements of the well-founded relation (or the unique minimum element of a well-ordered relation). For many proofs the default heuristic helps, but that default is not a universal law of induction.

No base case? no problem!

To reiterate, my first problem with how course-of-values induction is taught is that the default base-case/inductive-case split may not be the most fruitful one. The second problem I have is that you may not even need such a split. Usually it is assumed that there must be some “base case”, that does not get to exploit the induction hypothesis, for how could it ever apply? But in some proofs you can always exploit the induction hypothesis. This is exactly the case in the following proof of the Principle of Infinite Descent that uses the Principle of Well-founded Induction.

Theorem: Principle of Infinite Descent $(\forall x \in X . Φ (x) ⟹ \exists y \in X . y ⊏ x \land Φ (y)) ⟹ \forall x \in X . \neg Φ (x) .$

Proof. Suppose $(\forall x \in X . Φ (x) ⟹ \exists y \in X . y ⊏ x \land Φ (y))$ . Now we prove $\forall x \in X . \neg Φ (x)$ by well-founded induction on $x \in X$ . We will use as our predicate $Ψ (x) := \neg Φ (x)$ , which is synonymous with $Φ (x) ⟹ ⊥$ , i.e. $Φ (x)$ implies contradiction. Suppose $x \in X$ , and suppose $((\forall y \in X . y ⊏ x ⟹ \neg Φ (y))$ At this point it suffices to show $\neg Φ (x)$ , for which it suffices to assume $Φ (x)$ and deduce a contradiction. So suppose $Φ (x)$ . Now, applying our first assumption to this particular $x$ , we learn $Φ (x) ⟹ \exists y \in X . y ⊏ x \land Φ (y))$ , and combining this with our most recent assumption, we learn that $\exists y \in X . y ⊏ x \land Φ (y)$ , so consider that $y \in X$ . Since $y ⊏ x$ , we can use the induction hypothesis to prove $\neg Φ (y)$ . But we also have $Φ (y)$ , to which we can apply $\neg Φ (y)$ to deduce a contradiction. QED.

This proof of proof by infinite descent used course of values induction, but did not need to consider distinct cases related to $x \in X$ . There is no “base case” here, in the sense that no part of the proof must be separately completed without the help of an induction hypothesis. Now, when specialized to natural numbers, one could do so, breaking out the $0$ case and arguing directly that there is no smaller number, without reference at all to $Φ (x)$ . Or more nebulously, we can split on “minimal elements” versus “non-minimal elements”, but to what end?

Part way back around, without proof by contradiction

The following is a bonus: thanks to Andrej who made the observation that inspired it.

We’ve proven that if $⊏$ is well-founded, then principle of infinite descent applies. What about the reverse? The answer is “it depends.” In particular, if you prefer to work without proof by contradiction (i.e. constructively), then you can only get part way there. But with some use of proof by contradiction, you can get all the way there. In particular, we can prove that for an arbitrary binary relation $R$ on a set $X$ , if it satisfies the statement of the principle of infinite descent, then it has no infinite descending chains. Wait, isn’t that enough? Well, no, not in a world where you want to work constructively, without proof by contradiction. I sneakily defined well-foundedness as the assertion that all chains are finite, which is a stronger statement than the assertion that there can be no infinite descending chains, so long as you avoid proof by contradiction. The positive conception is nice because it works in both contexts, whereas “no infinite descending chains”, which is often taken as the definition of well-foundedness in classical logic, does not.

Ok on to the proof. First, let’s get precise about descending chains. Define a descending chain of a binary relation $R$ to be a partial function $f : N ⇀ X$ such that for all natural numbers $n$ , if $f (n + 1)$ is defined, then so is $f (n)$ , and furthermore $f (n + 1) R f (n)$ . Then an infinite descending chain is a descending chain that is defined for every natural number $n$ .

Now, we can prove that if $R$ is an arbitrary binary relation (rather than a well-founded onw) that nonetheless satisfies the letter of the principle of infinite descent, then $R$ has no infinite descending chains.

Proof. Suppose $R \subseteq X \times X$ , and suppose that for any property $Φ (x)$ : $(\forall x \in X . Φ (x) ⟹ \exists y \in X . y R x \land Φ (y)) ⟹ \forall x \in X . \neg Φ (x) .$ Then it suffices to assume $f : N \to X$ is an infinite descending chain and then deduce a contradiction. To do so, it further suffices to let $Φ (x) := \exists n \in N . f (n) = x$ and prove the premise of the proof-by-infinite-descent-shaped assumption, because from that we can deduce that $\forall x \in X . \neg \exists n \in N . f (n) = x$ , (i.e. $f$ is totally undefined, the empty partial function) which combined with $x = f (0)$ and $n = 0$ suffices to deduce a contradiction.

Let’s prove the sufficient condition: suppose $x \in X$ and $Φ (x)$ , i.e. that $f (n) = x$ for some $n \in N$ . Then $Φ (f (n + 1)$ holds, and by assumption $f (n + 1) R f (n)$ , which can be combined to prove the existential conclusion. QED.

Coda

According to John Stillwell’s Elements of Algebra (Section 2.10), Blaise Pascal was “the first to use induction in the ’base case, induction step’ format” in a paper from 1654. But our modern understanding of induction has it’s origins in Richard Dedekind’s seminal 1888 book “Was sind und was sollen die Zahlen?” (sometimes translated “What are the numbers and what are they for?”). Many defaults—be they notation, techniques, and explanations—in mathematics have historical origins that precede a more sophisticated understanding of phenomena, and sometimes an impedance mismatch between the historical and the modern can limit our understanding (treatment of variables and functions by mathematical texts is one of my standard whipping horses, but that’s a topic for another day). Those perspectives persist because they were useful, and are still useful today, but it also helps to consider how they might best be incorporated with those refined conceptions we’ve gained since their inception.

As for the use of proof by infinite descent versus a proof like Andrej’s that argues directly from minimality, one could reasonably argue that explicitly calling out the existence of a minimal case is more intuitively graspable for some arguments. It’s not clear to me whether arguing from the perspective of “there must be one or more minimal cases” versus “you can’t go down this way forever” is more generally clear, especially once you have come to understand proof by infinite descent. Furthermore, in the case of general well-foundedness, there may not be a single smallest case, but rather a (possibly) infinite number of minimal cases, in which case calling all of them out abstractly may be less compelling than calling out a singular minimum as was the case (pun intended) for the natural numbers. Which argument is most enlightening probably depends on personal taste.

Acknowledgments

Many thanks to Andrej Bauer for feedback on this post! Thanks to folks on Hacker News who noted that the annoying proof did not address $a = 0$ .

For those interested in precise paper proofs, I highly recommend Leslie Lamport’s talk and paper on the topic.↩︎

Sunday, 25 February 2024

Topology Episode Four: Subspaces

Topological spaces are monoliths, but as with many mathematical phenomena, we can talk about fragments of them. In this case the fragments of interest are $subspaces$ , and there are plenty of them. How many? Well, Given a topological space $(U, δ)$ , which is some ``universe'' set plus it's touching relation, there is a subspace corresponding to each subset of the universe. But the subset alone doesn't wholely determine the subspace: we also need to know it's topological structure, and it would be nice to know ``why that structure?''.

As usual, I was a bit boggled by the definition of subspaces from Munkres. The textbook says that, given a topological space, defined by a universe $U$ and a set of open sets $O$ , any subset $Y \subseteq U$ induces a subspace toplogy $(Y, O_{Y})$ , who's open sets $O_{Y}$ can be defined as follows: $O_{Y} = {R \cap Y for R \in O}$ In words, the open sets of subspace Y are exactly those sets that can be formed by intersecting $Y$ with some open set $R$ of $U$ . Ooookay, based on the rules Munkres lays down for the open sets of a topological space, this sure seems to define a topological space over Y:

The empty set is still a member of $O_{Y}$ since $\emptyset \in O$ and $\emptyset \cap Y = \emptyset$ .
The entire set $Y$ is open since $U$ is open and $U \cap Y = Y$ .
$O_{Y}$ is closed over arbitrary unions of its elements. To see this, observe that in general, set intersection distributes through arbitrary set union, i.e., for any set of sets $Q$ and any set $A$ , $(⋃_{q \in Q} q) \cap A = ⋃_{q \in Q} q \cap A$ . It follows, then, that union of some open sets in $U$ intersected with $Y$ forms the same set as the union of the intersection of each relevant open set with $Y$ , each of which is an element of $O_{Y}$
A similar argument shows that $O_{Y}$ is closed over finite intersection of open sets.

Clearly this topological space $(Y, O_{Y})$ is derived from $(U, O)$ . But, as usual, even though I can \emph{prove} that the topology that we just constructed on $Y$ is indeed a topological space, and $Y$ is indeed a subset of $U$ , and the construction clearly involved the topology imposed on $U$ by its open sets $O$ , I come away with very little intuition for why this might be the right thing to do. Can we fix this? I think we can! To make progress, we recast subspaces in terms of touching, and I'm pleased to report that the results are pretty satisfying. I think there is more insight to uncover, though, but I'll save that for a future post.

Recall that a set is open if and only if every element of that set does not touch ``the outside'' (i.e. the complement of the set). So then, given some $Y \subseteq U$ , what kind of touching relation does the intersection construction above correspond to? I'm happy to report that its description is just as, if not moreso, compelling: If $δ$ is the touching relation corresponding to open sets $O$ , then the touching relation $δ_{Y}$ corresponding to open sets $O_{Y}$ is just the restriction of $δ$ to exactly those pairs $a δ A$ such that $a \in Y$ and $A \subseteq Y$ . In math: $\forall a \in Y, A \subseteq Y . a δ_{Y} A ⟺ a δ A .$

I find this $very$ satisfying, because in the open set-based definition, the new open sets $O_{Y}$ are related but \emph{not the same as} the open sets $O$ . Some subsets of $Y$ were \emph{not} open in $U$ , but now are in $Y$ . You have to do some work, and keep in your mind the original topological space to understand the new one. In contrast, the touching relation for a subspace has \emph{exactly the same touching relationships} as the superspace. Nothing changed, we just restricted our perspective.

Ok, we're actually on the hook to prove that $O_{Y}$ and $δ_{Y}$ define the same topological space. Honest confession: it took me a looong time to prove this to my satisfaction, and as is often the case, when I finally got one (of too many) proofs to go through, I was miffed that it took me so long (on the upside I learned \emph{a bunch} from getting multiple proofs to go through: each had a different structure which highlighted a different insight).

Here is the most expedient proof, and I think it is still insightful. First, we flip to closed sets: $C_{Y} = {Y ∖ (R \cap Y) for R \in O}$

Every closed set is the complement of some open set, so we build on the definition of $O_{Y}$ . From here, we can apply some some set-theoretic algebra, keeping in mind that $Y$ is a subset of $U$ : $Y ∖ (R \cap Y) = (U \cap Y) ∖ (R \cap Y) = (U ∖ R) \cap Y$

And since $R$ is open in $U$ , that means that $U ∖ R$ is closed in $U$ ! So, taking $C$ as the closed sets of $U$ , $C_{Y} = {R \cap Y for R \in C} .$

It's kind of comforting that we get the closed sets of the subspace using the same construction as we used to get its open sets, but we still have this annoying property that some closed subsets of $Y$ are not closed subsets of $U$ . This is where switching to reasoning about the \emph{closure} of an arbitrary subset of $Y$ pays great dividends. By definition of touching, if $a \in Y$ and $A \subseteq Y$ , then $a δ_{Y} A ⟺ a \in {closure}_{Y} (A)$ , and $a \in {closure}_{Y} (A) ⟺ a \in closure (A)$ To summarize, the subspace topology does not preserve open sets, nor does it preserve closed sets, but it \emph{does} preserve \emph{and reflect} touching.

A Directed Graph Example

To get some sense of how subspaces work, let's see how directed graphs and topologies induced by them manifest subspacing. For this example, we return to the ``Fallen P'' graph from my previous post on connectivity:

and we focus on Ahlborn's default topology, wherein a node $a$ of the graph touches a set of nodes $A$ if and only if the $a$ is reachable from $A$ . So for instance, in the Fallen P graph, $C δ {A, D}$ .

Adapting the idea of subspace to Ahlborn's default topology, a subspace is incuded by some subset of the nodes. However, directed graphs themselves have a notion of \emph{subgraph}, which is some subset of the nodes of the original graph combined with some subset of the edges among the kept nodes.

A particularly interesting kind of subgraph is the $induced subgraph$ which is what you get when you keep certain nodes of the original graph, and then proceed to retain all of the edges of the original graph that span keptnodes. For instance, if we keep nodes $A, C, D$ , having discarded only node $B$ , we get the following induced graph.

The only remaining edge in the graph connects $C$ to $D$ because every other edge involved $B$ which has been voted out of the archipelago.

Things get interesting, however, when we consider the touching relation of the $subspace$ induced by $A, C, D$ , rather than the edges and paths of the similarly-induced $subgraph$ . Since all still-relevant touching relationships are preserved by taking a subspace, we $still$ have $C δ {A, D}$ , which means intuitively that $C$ is reachable from ${A, D}$ . At the least, this means that the Ahlborn default topology induced by an induced subgraph \emph{need not} coincide with the subspace topology induced by considering the same nodes of the original graph and its topology. Curious!

To be honest this is contrary to what I had originally guessed would happen. But on further consideration this makes sense. In my post on connectivity, we observed that if you add new edges to a graph that represent every self-loop and path among nodes, i.e., ensure that there is an edge from each node to each $reachable$ node:

then the resulting graph, which I'll call the ``Monster P'' graph, induces the same Ahlborn default topology as the original. So the topology does not distinguish between edges and paths. There is no ``metric'' for connectivity: $0$ to $n$ ``hops'' along edges, for any $n$ , are indistinguishable.

Now consider the subgraph induced by $A, C, D$ from this monstrosity:

Since the ``Monster P'' graph reifies all instances of reachability as direct edges, the present induced subgraph preserves all of the reachability relationships from the ``Fallen P'' graph. As such, its induced Ahlborn default space is the same as the subspace topology induced by

A, C, D

from the original ``Fallen P'''s Ahlborn default topology. So the sense in which the subspace construction preserves the topological structure of the retained nodes seems to be fundamentally about preserving reachability, with no conception of which actual edges of the original graph contribute to that reachability. An interesting abstraction!

Saturday, 13 May 2023

Topology Episode 3: Connectivity and Directed Graphs

In our last episode, I introduced the idea of topological connectedness and related it to the mean value theorem from real analysis. It's kind of cool that that theorem does not strictly depend on metric spaces, but rather depends on ideas of touching, in particular connectedness.

It makes a lot of sense that in a standard topology text, topological concepts are often demonstrated in the context of analysis, since historically point-set topology was a way of abstracting analysis, until at some point it became its own discipline. But it can be nice to see some of these ideas at play in other applications.

This is a good opportunity to treat a structure is commonly used in computer science and that most folks would consider to have "topological structure", but that we do not often describe in the language of point-set topology: directed graphs. If you are like me, then you have some intuition that nodes and edges in a directed graph exhibit a notion of "distanceless connectivity", in the sense that we do not necessarily associate a distance metric with each edge, but we do tend to think that two nodes that have an edge between them are "connected", and that a node that's a minimum of two hops away is also connected, but "further" than a node that's one hop away. So how do these intuitions play out in topology? I'll be honest: I started writing this post with intuitions that suggested formal conjectures that turned to be wrong! I'll try to swallow my pride and point out my errors along the way or in summary. For me, engaging with the formalism of topology in terms of touching, and reckoning with my intuitions about "the topological structure of directed graphs" taught me a lot about topology itself and what intuitions I have about identifying topological structure in graphs. And boy have I made a lot of mistakes along the way!

Connectivity, Again

First, recall the standard definition of connected space $U$ :

$\begin{matrix} \forall A \subseteq U . (\exists x \in U . x \in A) \land (\exists x \in U . x \in U ∖ A) ⟹ \\ (\exists a \in A . a δ (U ∖ A)) \lor (\exists x \in (U ∖ A) . x δ A) . \end{matrix}$

The premise is about splitting the space into two non-empty parts: we don't really care about that here. We instead care about the consequent, which imposes at least one touching constraints. Topology books give a more compact definition of connectivity in terms of the impossibility of splitting the space into two disjoint non-empty open sets. In my view this definition is elegant in expression but obscures the disjunction that naturally appears when you ask what is possible rather than what is impossible. I find this framing informative.

And to be honest, I mis-translated it at first, using conjunction rather than disjunction, which demonstrates a shortcoming in my understanding! The key is that in any split at most one of the sets can be open, which implies that touching need not be "two-way" across the divide. I missed this fundamental asymmetry in connectedness, and it's easy to ignore this distinction in geometry (the study of "land measurement" where measurement determines proximity), thus real analysis, but one-way connectedness is important in a directed graph, where directed flow, not distance, determines connectivity. In a space where touching captures graph structure, being a connected component ought not require that all nodes are directly connected to one another. As it turns out, topological connectedness can account for that. In my view, seeing how topological connectivity can reflect graph connectivity sparks insights: it can help you more clearly grasp some of the richness (and limitations) of this topological concept, and of topology itself.

Directed Graph Topologies

To help ground this discussion, let's consider a particular directed graph. This one is tractable but not totally trivial:

The graph has 4 nodes and 4 edges, features a cycle, and also features a node to which one cannot return by following edges. For lack of a better name, I'll call it "The Fallen P Graph" because it looks to me like the letter P has stumbled and fallen, but now opts to rest rather than serve its lexical master.

Now for some topology! A great resource for topologies on directed graphs is a technical report from 1964 by Thomas J. Ahlborn (I think that this was his MSc. thesis?), which proposes 3 different ways to assign topologies to directed graphs, though focusing primarily on one (which I'll refer to as "preferred" or "default") topology. It was followed up by a paper from 1968 that proves more results about the preferred 1964 topology. That Ahlborn could conceive of 3 topologies for directed graphs might hint that topological structure is not a singular thing: you can analyze different notions of proximity present in a single class of objects (in this case directed graphs).

I will start by recalling Ahlborn's first topological space, defined in terms of open sets, and then transform the definition into one phrased in terms of touching. The strategy I use to transform a topological space defined in terms of open sets (the norm) into a topological space defined in terms of touching involves the following three observations:

A set $A$ is open if and only if the set $U ∖ A$ is closed;
The closure of a set $B$ is the smallest (in terms of set containment) set $C$ such that $B \subseteq C$ and $C$ is closed;
An element $a$ touches a set $A$ if and only if $a$ is an element of the closure of $A$ .

In general, this is kind of a pain, and getting it right requires being ruthlessly logical (says the guy who's definitely gotten it wrong by taking shortcuts!). But I find the insight gained by characterizing touching to be psychologically helpful, even more so than thinking about the entire closure in aggregate. Hopefully you'll see what I mean as we go along.

As a side note, I think it's useful to also keep in mind that a set

A

is open if and only if each element

a

A

does not touch the set complement (i.e.,

a δ̸ U ∖ A

). I don't find this helpful for deriving touching (at least doing so leads me more readily to error), but I do find it helpful for thinking about how open sets directly relate to touching. Perhaps if I focused on the interior of a set and characterizing apartness (i.e. non-touching), this would be the better way to go.

For the following definitions of topologies on directed graphs, let

U

be the set of all the nodes of a graph,

a

a single node, and

A

some (possibly empty) set of nodes.

Definition 1a (Ahlborn's default/preferred). A set of nodes

A

is open if and only if there is no edge from any node

x \in U ∖ A

to any node

a \in A

Metaphorically, you can think of this an open set as a castle on lock-down that ensures no entry from the outside world! Let's consider the open sets of The Fallen P Graph. Since the graph has 4 nodes, there are

2^{4} = 16

subsets of nodes, including all of the nodes (

U

) and none of them (

\emptyset

). Which of the 16 are open? Conveniently 16 is a small-ish number (unlike the number of real numbers between 0 and 1, for instance), so we can just check all of them. Because I'm a pedant, I wrote down all 16 subsets on paper and started checking them, one by one, but if I eyeballed it, I might have been able to just write them down (though to be honest, I probably would have forgotten

\emptyset

even though it's always open). Here are the 3 (count 'em, three) open sets:

\emptyset

{A}

, and

{A, B, C, D}

Wow, that's...underwhelming. Two of those sets (

U

and

\emptyset

are always open. So we get one singleton as a prize: you can confirm that no edges from outside of

{A}

hit the only node in the set. Personally I don't find that this enumeration of open sets gives me all that much insight.

So instead let's get back to my favourite lens on topology: what does Ahlborn's preferred topology imply for touching? Well, let's follow the steps above:

Step 1:

A

is open then

U ∖ A

is closed. How do we describe closed positively, rather than just "not open"? Well, a set

A

is closed if every edge from a point in

A

arrives at some other point in

A

. Interesting: my intuitions about "graph topology" quite like this definition of closed sets. Of course intuitions are a fickle guide, but this proffers a feeling of comfort---moreso than the definition of open despite their being interchangeable as definitions of the topology.

Technically we have two proof obligations. First we must prove that if

A

is open then

U ∖ A

has this property, which seems not bad since the latter set is everything else so where the heck else is an edge going to land? Second, we must also prove that if

A

satisfies this property, then

U ∖ A

is open: also not bad since we've ensured that

A

is everything else as far as

U ∖ A

is concerned, and since every edge in

A

lands in

A

, none of them land in

U ∖ A

(isn't the law of excluded middle grand?).

Since The Fallen P has 3 open sets, it also has 3 closed sets:

\emptyset

{B, C, D}

, and

{A, B, C, D}

. You can confirm that any edge that starts at a node in

{B, C, D}

ends in that same set. Don't read too much into this example, though: a more complex directed graph can produce some more interesting open/closed bifurcations of the set of all nodes

U

Step 2:

Here's where things get fun. Suppose

B

is a set. What is its closure, as in, what is the smallest set

C

such that every node in

B

is in

C

(that's the definition of set containment), and furthermore, every edge that starts in

C

ends in

C

? I find it helpful here to think "algorithmically" for a second: First, add all the nodes from

B

into our potential

C

. Next, suppose we find every edge that starts in our potential

C

and ends outside of potential

C

, and just add those new nodes to

C

. Cool, now all the nodes from

B

are in

C

and end in

C

. But what about these new nodes? Some edges that start from new nodes may end in

C

but others may not. So we should add new new nodes, based on following edges that take us outside of

C

. The same thing happens, so we need to add new new new nodes. And so on. Ideally you see where this is going: the closure of

B

is the set of nodes that includes

B

itself and any node

x

that can be reached in any number of steps from a node in

B

. Another way to put this latter part is to say that there is a path from some node in

B

x

. The whole thing can be succinctly summarized by saying

x

is reachable from

B

So once we have our closed

C

we can be sure that every node

x \in U ∖ C

is not reachable from

C

, so there is no single edge from

C

into

U ∖ C

, so the latter is open. This all seems to be working out.

Returning to The Fallen P, we can give some examples of closure. First the boring ones, the closure of each closed set, like

{B, C, D}

is itself: we need not add any extra nodes to this set to make it closed. Another way to put it is that the set itself is the smallest set that contains it and is closed. As for non-boring ones, there are some interesting phenomena. The closure of

{B, D}

{B, C, D}

because of the two closed sets that contain

{B, D}

---namely

{B, C, D}

and

{A, B, C, D}

---

{B, C, D}

is the smallest. On the other hand, the closure of any set that contains

A

{A, B, C, D}

because the latter is the only closed set that contains the former.

Step 3:

The hard work has been done, and now we can state our definition by determining what it means for a node to be an element of the closure of a set of nodes. That gives us a new definition

Definition 1b (Ahlborn's default/preferred).

a δ A

if and only if

a \in A

or there exists a path from some node in

A

a

So in Ahlborn's topology, touching boils down to a node either being an element of a set or being reachable from the set. Cool! Spelling this out was very helpful for me because it was not obvious to me that touching could partly be determined by nodes that are not themselves elements of

A

. Because of paths, a node could touch

A

not because of an edge from

A

to it, but rather because some other node not in $A$ has an edge to it. So it's useful to keep the idea of closure in mind when you think about touching, even though touching is the property that continuous functions preserve, not closure!

This topology has some interesting features worth pondering. First, notice that

a δ {a}

even if $a$ has no edge from $a$ to $a$ . So touching has a fundamental "reflexivity" to it: a graph with a self-loop has the same default topology as one that does not. In particular, the following graph has the same Ahlborn-preferred topology as The Fallen P:

Second, since non-trivial touching is characterized by the mere existence of paths, having multiple paths to a node makes no difference in the topology imposed by the graph. Furthermore, the touching relationship tells you nothing about how many edges one must traverse on the path from one node to another node, so there is no notion of "distance" either. In particular, the following graph also has the same Ahlborn-preferred topology as The Fallen P:

In Fallen P, it takes "two hops" to get from A to C; In this graph it only takes "one hop". Topology don't give a hoot!

In fact, this particular topology can be determined from just the reflexive-transitive closure of edges.

In short, the Ahlborn-preferred topology abstracts the reachability properties of a graph from the rest of the directed graph structure. It's important to remember that the notion of "reachability" here always claims that a node is always reachable from itself, path or no path.

We've talked about open sets, closed sets, and closure in this topology. I also find it useful to take a moment to consider two other common topological features:

The interior of a set $A$ is the largest (again by set containment) set $B \subseteq A$ for which no edges start in $U ∖ A$ and end in $B$ . This looks a lot like the definition of an open set, and that's no coincidence because $B$ is open. So taking the interior is like pulling up the drawbridges and closing the gates: you can hear the sound of battle outside the castle walls, being waged by those who were who are left on the outside ("At the going down of the sun and in the morning, We will remember them"). So the interior of a set of nodes is those nodes that are not reachable from outside the set. For example, in The Fallen P, the interior of ${A, B}$ is ${A}$ .
The boundary of a set $A$ is those elements of $U$ that touch both $A$ and $U ∖ A$ . In this topology, this is the set of nodes that are reachable from inside and outside the set. For example, in the graph A --> B <-- C, ${B}$ is the boundary of both ${A}$ and ${C}$ . In the Fallen P, ${B}$ is the boundary of a number of subsets, including ${A}$ . Can you name all the other subsets of $U$ for which ${B}$ is the boundary?

Finally, a useful exercise for you would be to revisit the touching axioms and confirm that they all hold for each of these definitions. In fact the "transitivity" case of touching helps explain why the definition of touching for Ahlborn's preferred topology is in terms of paths rather than edges, so this is well worth considering.

Connectivity

Ahlborn relates topological connectivity in his default topology to graph structure, connecting it to weak connectivity. A directed graph is weakly connected if every node is reachable from every other node so long as you ignore the directionality of the arrows. In other words, a directed graph is weakly connected if its corresponding undirected graph is connected. He relates other graph properties to topology as well! This is comforting and supports the idea that the topology does preserve some interesting properties of the graph which we can then call "topological" properties. But are these exactly all and only the topological properties of a directed graph? Not so much! That Ahlborn provides demonstrates other topologies says as much...

Two More Topologies

Here I will just give the touching-based definitions of Ahlborn's other two topologies.

Definition 2 (Ahlborn's

γ

a δ A

if and only if

a \in A

or there exists a path from

a

to some node in

A

Definition 3 (Ahlborn's

δ

). This one's a tad peculiar. First define an anti-edge from

a

b

to be the absence of such an edge in a graph. Second, define an anti-path from

a

b

to be a transitive sequence of anti-edges starting at

a

and ending at

b

. Then:

a δ A

if and only if

a \in A

or there exists an anti-path from some node in

A

a

It's useful to note that given a graph

G

, the

γ

topology is the same as the default topology applied to the reversed graph

G^{'}

that arises by reversing all of the edges of

G

. So there is a duality between the default and

γ

topologies, induced by how you treat the directedness of edges.

Similarly, given a graph

G

, the

δ

topology is the same as the default topology applied to the shadow graph

G *

that arises by replacing all of the edges of

G

with all of the edges absent from

G

. This implies a different sort of duality between the default and

δ

topologies, induced by the presence and absence of edges.

Let's see how these new topologies play out for Fallen P, repeated here for convenience:

For

γ

, we can define touching by introducing a touching relation for every edge, i.e.:

A δ {B}, B δ {C}, C δ {D}, D δ {C},

and then expand the touched sets every arbitrary way, and expand the touched nodes transitively! Don't make me write out all the cases... In any case, this topology is the same as the default topology on "Reverse Fallen P":

For

δ

, we consider the anti-edges, but in a style analogous to the default topology, i.e.:

B δ {A}, C δ {A}, D δ {A}, e t c .

and expand accordingly. This topology is the same as the default topology on "Shadow Fallen P":

So many topologies! Which one is the right one? I expect that the answer, as usual is "it depends": on what questions about a graph you care about, and can be isolated in one of these topological structures. I expect that there are other topologies that might be devised and applied to directed graphs. For instance, why not have a topology assigned to the edges of a directed graph?

So there are lots of "topological" properties of a directed graph, in the sense that they can be viewed through the lens of topology, but it's not clear to me what the limit of them are or could be, either theoretically or pragmatically.

Examples/Exercises

Here's a few more simple graphs for you to ponder and enumerate open sets, closed sets, touching, boundaries, and what have you. The parentheticals include graph properties defined by Ahlborn but not explained here: consult Ahlborn's tech report for details.

Example 1: A --> B

This graph has two nodes and one edge. We can write out the cases of touching for the different topologies, omitting the cases that are true simply because touching extends set membership:

Example 2: A <-- B --> C (weakly connected but not unilaterally)

Example 3: A B --> C (disconnected)

Directed Graphs of Topologies?

It's worth noting that we can ask the reverse question: do topologies correspond to directed graphs? Well, in general no, but I think that any arbitrary touching relation can be seen as yielding a special kind of hyper-graph, where one side of every edge is a singleton set (you'll have to decide the direction, it's the difference between the default and

γ

topologies)! I would not be surprised if this perspective, generalized to arbitrary hypergraphs, might be related to proximity and near sets.

A topic to explore for another day.

Acknowledgments

I have to thank Michael Betancourt for inspiring these investigations and this post, and providing wonderful feedback and suggestions. His efforts to explain some essential aspects of topology to statistical modeling students sparked a wide-ranging conversation that led me to ponder directed graphs (among other things).

History

May 13, 2023 - First Posted

May 16, 2023 - GraphViz'd the graphs

June 7, 2023 - Incorporated a running example, fixed broken definition of

δ

topology.

Related Work

Sunday, 12 January 2025

What's So "Natural" About Natural Transformations?!?

My Problem

What’s a Category?

What’s a Functor?

What’s a Natural Transformation, and What’s So “Natural” About It?

Monday, 1 July 2024

A Proof of Proof by Infinite Descent

Two Proofs That $\sqrt{2}$ is Not Rational

Well-foundedness and Well-ordering

Proof by Infinite Descent, Generalized

What even is a “base case”?

No base case? no problem!

Part way back around, without proof by contradiction

Coda

Acknowledgments

Sunday, 25 February 2024

Topology Episode Four: Subspaces

A Directed Graph Example

Saturday, 13 May 2023

Topology Episode 3: Connectivity and Directed Graphs

Connectivity, Again

Directed Graph Topologies

Step 1:

Step 2:

Step 3:

Connectivity

Two More Topologies

Examples/Exercises

Directed Graphs of Topologies?

Acknowledgments

History

What's So "Natural" About Natural Transformations?!?

Report Abuse

Sunday, 12 January 2025

What's So "Natural" About Natural Transformations?!?

My Problem

What’s a Category?

What’s a Functor?

What’s a Natural Transformation, and What’s So “Natural” About It?

Monday, 1 July 2024

A Proof of Proof by Infinite Descent

Two Proofs That 2 is Not Rational

Well-foundedness and Well-ordering

Proof by Infinite Descent, Generalized

What even is a “base case”?

No base case? no problem!

Part way back around, without proof by contradiction

Coda

Acknowledgments

Sunday, 25 February 2024

Topology Episode Four: Subspaces

A Directed Graph Example

Saturday, 13 May 2023

Topology Episode 3: Connectivity and Directed Graphs

Connectivity, Again

Directed Graph Topologies

Step 1:

Step 2:

Step 3:

Connectivity

Two More Topologies

Examples/Exercises

Directed Graphs of Topologies?

Acknowledgments

History

What's So "Natural" About Natural Transformations?!?

Two Proofs That $\sqrt{2}$ is Not Rational