Posts on stylewarning's screed

A tutorial quantum interpreter in 150 lines of Lisp

Sun, 16 Jul 2023 00:00:00 +0000

By Robert Smith

Simulating a universal, gate-based quantum computer on a classical computer has many uses and benefits. The top benefit is the ability to inspect the amplitudes of the system’s state directly. However, while the mathematics is very well understood, implementing a general-purpose simulator has largely been folk knowledge. In this tutorial, we show how to build an interpreter for a general-purpose quantum programming language called $\mathscr{L}$, capable of executing most kinds of quantum circuits found in literature. It is presented economically, allowing its implementation to take fewer than 150 lines of self-contained Common Lisp code. The language $\mathscr{L}$ is very simple to extend, making the interpreter ripe for testing different kinds of behavior, such as noise models.

Introduction

Simulating the workings of an ideal quantum computer has many important applications, such as algorithms research and quantum program debugging. A variety of quantum computer simulators exist, both free and commercial. However, while the concept of the simulation of quantum computers is generally well understood at a high level, the devil is in the details when it comes to implementation.

Quantum computer simulators found in the wild often have many limitations. The most prevalent limitation is the number of qubits an operator can act on. Usually, one-qubit gates and controlled one-qubit¹ gates are allowed, but nothing more. While these together are sufficient for universal quantum computation, it leaves much to be desired when studying quantum algorithms.

In this post, we present an implementation of a fully general quantum programming language interpreter, allowing measurement as well as arbitrary unitary operators on an arbitrary number of arbitrarily indexed qubits. The implementation weighs in at under 150 lines² of code in Common Lisp, though the ideas make implementation simple in other languages as well. All of the code from this tutorial can be found on GitHub.

This tutorial is aimed at a quantum computing beginner who has some familiarity with the fundamentals of linear algebra and computer programming. Beyond those subjects, this tutorial is relatively self-contained. We also aim this tutorial at practitioners of quantum computing, who are interested in the brass tacks of simulation, with all of the details filled out. To such practitioners, the bulk of this document will be easy to skim, since we recapitulate topics such as qubits and unitary operators.

A note about Common Lisp

We use Common Lisp, because it is an excellent platform for both exploratory and high-performance computing. One of the fastest and most flexible quantum simulators out there, the Quantum Virtual Machine, is written entirely in Common Lisp.

We wrote this article so that it would be easy to follow along with a Common Lisp implementation. The code has no dependencies, and should work in any ANSI-compliant implementation (I hope).

With that said, this article was also written with portability in mind. Since no especially Lisp-like features are used, the code should be easy to port to Python or even C. At minimum, your language should support complex numbers and arrays.

A note to experienced quantum computing practitioners

This section is written for experienced practitioners of quantum computing who happened upon this post, and can be skipped.

In this post, we opt to simulate a quantum circuit the “Schrodinger” way, that is, by evolving a wavefunction explicitly. For a circuit of width $n$, we walk through the mathematics of how to interpret a $k$-qubit gate $g \in \mathsf{SU}(2^k)$ for $k\le n$, specified to act on a $k$-tuple of numbered qubits—corresponding to each qubit’s position in the tensor product which forms the Hilbert space of the system—as a full operator $g'\in\mathsf{SU}(2^n)$. We do this by providing an explicit construction of the matrix in the computational basis of the system.

An alternative approach would have been to describe the action of a $g$ on an $n$-qubit wavefunction by way of careful manipulation of indexes, i.e., to effectively permute and partition our wavefunction into $2^{n-k}$ groups of $2^k$-dimensional vectors corresponding to the subsystem of qubits being operated on. The major benefit of this approach is efficiency.

As a first introduction to a computer science graduate, I find this explanation lacking in two ways:

It under-emphasizes that a gate like $\mathsf{CNOT}$, typically written as a $4\times 4$ matrix $\mathsf{I}\oplus\mathsf{X}$, in a quantum circuit truly is a linear operator on the Hilbert space of the entire system. “It’s just linear algebra; here’s the matrix and here’s the vector” is a point I want to drive home.
It requires significant labor to both explain and prove the correctness of the method, without having significant experience with tensor algebra, contractions, Einstein notation, and so on.

The approach of this post can be used as a basis to follow up with more efficient techniques, without relinquishing a strong mathematical foundation. We are very careful to not be hand-wavy, and to not conflate the different vector spaces at play. We hope that you’ll find this approach agreeable, even if it sacrifices some efficiency.

The Language $\mathscr{L}$

We wish to construct an interpreter for a small quantum programming language named $\mathscr{L}$. This language supports both of the fundamental operations of a quantum computer: gates and measurements.

A gate is an operation that modifies a quantum state. (What a quantum state is exactly we will delve into later.) Because quantum states are large compared to the physical resources used to construct them, gates represent the “powerful” operations of a quantum computer.

A measurement is an observation and collapse of the quantum state, producing one bit (i.e., $0$ or $1$) of classical information per qubit. Measurements represent the only way in which one can extract information from our simulated quantum computer, and indeed, in most programming models for real quantum computers.

In some sense, one might think of the language $\mathscr{L}$ as the simplest non-trivial quantum programming language. A program in $\mathscr{L}$ is just a sequence of gates and measurements. The syntax is as follows:

Non-Terminal		Defintion
program	:=	`(` instruction* `)`
instruction	:=	`(` `GATE` matrix qubit+ `)`
	\|	`(` `MEASURE` `)`
matrix	:=	a complex matrix `#2A(` … `)`
qubit	:=	a non-negative integer

Spaces and newlines are ignored, except to delimit the tokens of our language.

We borrow Common Lisp’s two-dimensional array syntax for the syntax of matrices. In Common Lisp, the matrix $\left(\begin{smallmatrix}1 & 2\\3 & 4\end{smallmatrix}\right)$ is written #2A((1 2) (3 4)). We also borrow the syntax for complex numbers: $1-2i$ is written #C(1 -2).

An example program might be one to construct and subsequently measure two qubits labeled 2 and 5 in a Bell state configuration:

(
(GATE #2A((0.70710677 0.70710677) (0.70710677 -0.70710677)) 2)
(GATE #2A((1 0 0 0) (0 1 0 0) (0 0 0 1) (0 0 1 0)) 2 5)
(MEASURE)
)

We will model the semantics of $\mathscr{L}$ operationally, by way of an abstract machine. The abstract machine for $\mathscr{L}$ is called $M_n$, where $n$ is a positive but fixed³ number of qubits. The state of the machine $M_n$ is the pair $(v, b)$ where $v$ is a quantum state, and $b$ is an $n$-bit measurement register.

The quantum state is an element of the set

$$\{\Vert v\Vert=1\mid v\in\mathbb{C}^{2^n}\}.$$

In other words, $v$ is a unit vector of dimension $2^n$ over the complex numbers. We will discuss this from first principles in the next section.

The measurement register is an element of the set $\{0,1\}^n$, i.e., a sequence of $n$ bits, which we realize as a non-negative integer. The $k$th least-significant bit of this integer represents the last observation of the qubit numbered as $k$. We will discuss this in detail as well.

In Common Lisp, it suffices to create a structure machine which holds these two pieces of state.

(defstruct machine
quantum-state
measurement-register)

Typically, the machine is initialized with each classical bit in the measurement register $0$, and each qubit starting in the zero-state. (However, for the purposes of algorithm study or debugging, the machine may be initialized with any valid state.)

The precise way in which the language $\mathscr{L}$ is interpreted on $M_n$ is what we describe in this tutorial. Before that, however, we find it most important to describe what exactly a quantum state is, and how to represent it on a computer.

The Quantum State

Where does one qubit live?

Quantum computers are usually just a collection of interacting computational elements called qubits. A single qubit has two distinguished states: $\ket{0}$ and $\ket{1}$. If the qubit has a name like $q$, then we label the states $\ket{0}_q$ and $\ket{1}_q$.

The funny notation is called Dirac notation or braket notation. It happens to be a convenient notation for doing calculations in quantum mechanics, and we just use it for consistency with other texts. The ket $\ket{\cdots}$, as a physicist would call it, doesn’t add any special significance, except to denote that the quantity is a vector. One can actually put anything inside the brackets. In usual linear algebra, one often writes $\mathbf{e}_i$ to denote a basis vector, where in quantum mechanics, one just writes the subscript in a ket $\ket{i}$, dropping the $\mathbf{e}$ entirely. If the notation throws you off, and you’d like to think in more traditional written linear algebra notation, you can always replace $\ket{x}$ with $\vec x$, and you’ll be safe.

These distinguished states $\ket{0}$ and $\ket{1}$ are understood to be orthonormal basis vectors in a vector space whose scalars are complex numbers $\mathbb{C}$. As such, a qubit can be $\ket{0}$, $\ket{1}$, or a superposition $\alpha\ket 0 + \beta\ket 1$, where $\alpha$ and $\beta$ are complex numbers. The numbers $\alpha$ and $\beta$ are called probability amplitudes, because $\vert\alpha\vert^2$ (resp. $\vert\beta\vert^2$) represent the probability of the qubit being observed in the $\ket 0$ (resp. $\ket 1$) state. Since they represent probabilities, there’s an additional constraint, namely that the probabilities add to one: $\vert\alpha\vert^2 + \vert\beta\vert^2=1$.

To those unfamiliar, it may not be obvious why we’ve opted to use the language of linear algebra. Why do we consider a qubit as being a linear combination? Why do we suppose that the observable states are orthonormal vectors? Why can’t we simply say that a qubit is just a pair of complex numbers and move on?

The reason for this is scientific, and not mathematical. It turns out that the best theory of quantum mechanics we have is one which describes transformations between states as being linear. In fact, the evolution of a quantum mechanical system is not only described by an operation that is just linear, but also reversible. These conditions—linear, reversible, and length-preserving—give rise to a special class of transformations called unitary operators, which naturally lead us to the discussion of vector spaces over complex numbers⁴.

We will discuss the nature of these operations in more depth when we consider how to implement gates later on. For now, however, it’s sufficient to think of a qubit named $q$ as something that lives in a complex, two-dimensional vector space, which we will call $$B_q := \operatorname{span}_{\mathbb{C}}\{\ket 0_q, \ket 1_q\}.$$ (We will use this $B_q$ notation a few times throughout this tutorial. Remember it!) We also understand that this space is equipped⁵ with a way to calculate lengths of vectors—the usual norm

$$ \left\Vert\alpha\ket{0}+\beta\ket{1}\right\Vert = \sqrt{\vert\alpha\vert^2+\vert\beta\vert^2}. $$

Many qubits

Roughly speaking, a single qubit can be described by two probabilities. How do we deal with more?

Suppose we have two qubits named $X$ and $Y$. As a pair, quantum mechanics tells us that they can interact. Practically, what that means is that their states can be correlated in some way. If they’ve interacted, knowing information about $X$ might give us a clue about what $Y$ might be. One well-known example of this is the Bell state, which can be summarized as follows:

Qubit $X$	Qubit $Y$	Prob. Amp.	Probability
$\ket 0_X$	$\ket 0_Y$	$1/\sqrt{2}$	$50\%$
$\ket 0_X$	$\ket 1_Y$	$0$	$0\%$
$\ket 1_X$	$\ket 0_Y$	$0$	$0\%$
$\ket 1_X$	$\ket 1_Y$	$1/\sqrt{2}$	$50\%$

Here, we have an example of a non-factorizable state; qubits $X$ and $Y$ are correlated to each other dependently. If we know $X$ is in the $\ket 0_X$ state, then we necessarily know that $Y$ is in the $\ket 0_Y$ state. Such a correlation means it’s not possible to express the probabilities independently. It might be tempting to think that one can simply think of $X$ having a $50\%$ probability of being in either basis state, and $Y$ having a $50\%$ probability of being in either state—facts which are certainly true—but considering those independently would give us a different distribution of probabilities of the system:

Qubit $X$	Qubit $Y$	Probability
$\ket 0_X$	$\ket 0_Y$	$P(X=\ket 0_X)P(Y=\ket 0_Y)=25\%$
$\ket 0_X$	$\ket 1_Y$	$P(X=\ket 0_X)P(Y=\ket 1_Y)=25\%$
$\ket 1_X$	$\ket 0_Y$	$P(X=\ket 1_X)P(Y=\ket 0_Y)=25\%$
$\ket 1_X$	$\ket 1_Y$	$P(X=\ket 1_X)P(Y=\ket 1_Y)=25\%$

This state is called factorizable because we can express each probability as a product of probabilities pertaining to the original qubits, i.e., each probability has a form that looks like $P(X)P(Y)$. Note that here, knowing something about $X$ gives us no information about $Y$, since they’re completely independent. With that said, it should be emphasized that factorizable states are perfectly valid states, but they don’t represent the entirety of possible states.

If qubits $X$ and $Y$ live in the linear spaces $B_X$ and $B_Y$ respectively, then the composite space is written $B_X\otimes B_Y$. This is called a tensor product, which is a way to combine spaces with the above structure. Formally, if we have an $m$-dimensional vector spaces $V:=\operatorname{span}\{v_1,\ldots,v_m\}$ and an $n$-dimensional vector space $W:=\operatorname{span}\{w_1,\ldots,w_n\}$, then their tensor product $T:=V\otimes W$ will be an $mn$-dimensional vector space $\operatorname{span}\{t_1,\ldots,t_{mn}\}$, where each $t_i$ is a formal combination of basis vectors from $V$ and $W$. (There are of course $mn$ different combinations of $v$’s and $w$’s.) To give an example without all the abstraction, consider $V$ with a basis $\{\vec x, \vec y, \vec z\}$ and $W$ with a basis $\{\vec p, \vec q\}$. Then $V\otimes W$ will have a basis

$$ \left\{ \begin{array}{lll} \vec x\otimes\vec p, & \vec y\otimes\vec p, & \vec z\otimes\vec p, \\ \vec x\otimes\vec q, & \vec y\otimes\vec q, & \vec z\otimes\vec q\hphantom{,} \end{array} \right\}. $$

An example vector in the space $V\otimes W$ might be

$$ -i(\vec x\otimes\vec p) - 2(\vec y\otimes\vec p) + 3 (\vec z\otimes\vec p) + \frac{1}{4}(\vec x\otimes\vec q) - \sqrt{5}(\vec y\otimes\vec q) + e^{6\pi}(\vec z\otimes\vec q), $$

assuming these vector spaces are over $\mathbb{C}$.

Intuitively, a tensor product “just” gives us a way to associate a number with each possible combination of basis vector. In our case, we need to associate a probability amplitude with each combination of distinguished qubit basis states. We need this ability since—as we’ve established—we need to consider every possible holistic outcome of a collection of qubits, as opposed to the outcomes of the qubits independently. (The former constitute both factorizable and non-factorizable states, while the latter only include factorizable states.)

Bit-String notation and a general quantum state

If we have qubits $X$, $Y$, and $Z$, then they’ll live in the space $B_X\otimes B_Y\otimes B_Z$, which we’ll call $Q_3$. It will be massively inconvenient to write the basis vectors as, for example, $\ket 0_X\otimes \ket 1_Y\otimes\ket 1_Z$, so we instead use the shorthand $\ket{011}$ when the space has been defined. This is called bit-string notation. A general element $\ket\psi$ of $Q_3$ can be written $$\psi_0\ket{000}+\psi_1\ket{001}+\psi_2\ket{010}+\psi_3\ket{011}+\psi_4\ket{100}+\psi_5\ket{101}+\psi_6\ket{110}+\psi_7\ket{111}.$$ There are two substantial benefits from using bit-string notation. These benefits are much more thoroughly explained in this paper—which was a precursor to this very blog post.

The first benefit is that the names of the qubits—$X$, $Y$, and $Z$—have been abstracted away. They’re now just positions in a bit-string, and we can canonically name the qubits according to their position. We record positions from the right starting from zero, so $X$ is in position $2$, $Y$ is in position $1$, and $Z$ is in position $0$.

The second benefit is one relevant to how we implement quantum states on a computer. As written, the probability amplitude $\psi_i$ has an index $i$ whose binary expansion matches the bit-string of the basis vector whose scalar component is $\psi_i$. This is no accident. The main outcome of this is that we can use a non-negative integer as a way of specifying a bit-string, which also acts as an index into an array of probability amplitudes. So for instance, the above state can be written further compactly as $$\ket\psi=\sum_{i=0}^7\psi_i\ket i.$$ Here, $\ket i$ refers to the $i$th bit-string in lexicographic (“dictionary”) order, or equivalently, the binary expansion of $i$ as a bit-string.

Since qubits live in a two-dimensional space, then $n$ qubits will live in a $2^n$-dimensional space. With a great deal of work, we’ve come to our most general⁶ representation of an $n$-qubit system: $$\sum_{i=0}^{2^n-1}\psi_i\ket i,$$ where $\vert\psi_i\vert^2$ gives us the probability of observing the bit-string $\ket i$, implying $$\sum_{i=0}^{2^n-1}\vert\psi_i\vert^2=1.$$

On a computer, representing a quantum state for an $n$-qubit system is simple: It’s just an array of $2^n$ complex numbers. An index $i$ into the array represents the probability amplitude $\psi_i$, which is the scalar component of $\ket{i}$. So, for instance, the state $\ket{000}$ in a 3-qubit system is represented by an array whose first element is $1$ and the rest $0$. Here is a function to allocate a new quantum state of $n$ qubits, initialized to be in the $\ket{\ldots 000}$ state:

(defun make-quantum-state (n)
(let ((s (make-array (expt 2 n) :initial-element 0.0d0)))
(setf (aref s 0) 1.0d0)
s))

Sometimes, given a quantum state, or even an operator on a quantum state, we will want to recover how many qubits the state represents, or the operator acts on. In both cases, the question reduces to determining the number of qubits that a dimension represents. Since our dimensions are always powers of two, we need to compute the equivalent of a binary logarithm. In Common Lisp, we can compute this by computing the number of bits an integer takes to represent using integer-length. The number $2^n$ is always a 1 followed by $n$ 0’s, so the length of $2^n$ in binary is $n+1$.

(defun dimension-qubits (d)
(1- (integer-length d)))

Evolving the quantum state

Since the quantum state is a vector, the principal way we change it is through linear operators represented as matrices. As our quantum program executes, we say that the quantum state evolves. Matrix–vector multiplication is accomplished with apply-operator and matrix–matrix multiplication is accomplished with compose-operators. There is nothing special about these functions; they are the standard textbook algorithms.

(defun apply-operator (matrix column)
(let* ((matrix-size (array-dimension matrix 0))
(result (make-array matrix-size :initial-element 0.0d0)))
(dotimes (i matrix-size)
(let ((element 0))
(dotimes (j matrix-size)
(incf element (* (aref matrix i j) (aref column j))))
(setf (aref result i) element)))
(replace column result)))
(defun compose-operators (A B)
(destructuring-bind (m n) (array-dimensions A)
(let* ((l (array-dimension B 1))
(result (make-array (list m l) :initial-element 0)))
(dotimes (i m result)
(dotimes (k l)
(dotimes (j n)
(incf (aref result i k)
(* (aref A i j)
(aref B j k)))))))))

These functions will sit at the heart of the interpreter, which will be elaborated upon in the section about gates.

Measurement

Already, through the construction of our quantum state, we’ve discussed the idea that the probability amplitudes imply a probability of observing a state. Measurement then amounts to looking at a quantum state as a discrete probability distribution and sampling from it.

Measurement in quantum mechanics is side-effectful; observation of a quantum state also simultaneously collapses that state. This means that when we measure a state to be a bit-string, then the state will also become that bit-string, zeroing out every other component in the process.

We thus implement the process of measurement in two steps: The sampling of the state followed by its collapse.

(defun observe (machine)
(let ((b (sample (machine-quantum-state machine))))
(collapse (machine-quantum-state machine) b)
(setf (machine-measurement-register machine) b)
machine))

Note that we’ve recorded our observation into the measurement register. We now proceed to define what we mean by sample and collapse.

How shall we sample? This is a classic problem in computer science. If we have $N$ events $\{0, 1,\ldots,N-1\}$, such that event $e$ has probability $P(e)$, then we can sample as follows. Consider the partial sums defined by the recurrence $S(0)=0$ and $S(k)=S(k-1) + P(k-1)$. If we draw a random number $r$ uniformly from $[0,1)$, then we wish to find the $k$ such that $S(k)\leq r < S(k+1)$. Such a $k$ will be a sampling of our events according to the imposed probability distribution.

We can implement this simply by computing successive partial sums, until our condition is satisfied. In fact, we can be a little bit more resourceful. We can find when $r-S(k+1)<0$, which amounts to successive updates $r\leftarrow r-P(k)$.

With a quantum system, we have $P(\ket i) = \vert\psi_i\vert^2$, and the sampled $k$ is the bit-string $\ket k$ we find.

Let’s do an example. Suppose we have a quantum state

$$ \sqrt{0.2}\ket{00} - \sqrt{0.07}\ket{01} + \sqrt{0.6}\ket{10} + \sqrt{0.13}\ket{11}. $$

Then our discrete probability distribution is:

$$ P(\ket{00}) = 0.2\qquad P(\ket{01}) = 0.07\qquad P(\ket{10}) = 0.6\qquad P(\ket{11}) = 0.13 $$

Next, suppose we draw a random number $r = 0.2436$. We first check if $r < 0.2$. It’s not, so $\ket{00}$ is not our sample. Subtract it from $r$ to get $r = 0.0436$. Next check if $r < 0.07$. Yes, so our sample is $\ket{01}$. Pictorially, this looks like the following:

The implementation is straightforward:

(defun sample (state)
(let ((r (random 1.0d0)))
(dotimes (i (length state))
(decf r (expt (abs (aref state i)) 2))
(when (minusp r) (return i)))))

Collapsing to $\ket k$ is simply zeroing out the array and setting $\psi_k$ to $1$.

(defun collapse (state basis-element)
(fill state 0.0d0)
(setf (aref state basis-element) 1.0d0))

Gates

Gates as matrices

Gates are the meat of most quantum algorithms. They represent the “hard work” a quantum computer does. As previously described, a gate $g$ is a transformation that is linear, invertible, and length-preserving.

Linear: $g(a\ket\psi+b\ket\phi)=ag(\ket\psi)+bg(\ket\phi)$.
Invertible: There is always an operation $h$ that can cancel out the effect of $g$: $h(g(\ket\psi))=g(h(\ket\psi))=\ket\psi$.
Length-Preserving: $\Vert g(\ket\psi)\Vert = \Vert\ket\psi\Vert$.

These ideas are captured by an overarching idea called a linear isometry, which comes from the Greek word isometria, with isos meaning “equal” and metria meaning “measuring”. As with all linear transformations, we can write them out as a matrix with respect to a particular basis. Matrices representing linear isometries are called unitary matrices⁷.

The simplest gate must be identity, a gate which does nothing.

$$ \mathsf{I} := \begin{pmatrix} 1 & 0\\ 0 & 1 \end{pmatrix} $$

In Common Lisp, this would be defined as

(defparameter +I+ #2A((1 0)
(0 1)))

which we will make use of later. Just a notch higher in complexity would be the quantum analog of a Boolean “NOT”. This is called the $\mathsf{X}$ gate:

$$ \mathsf{X} := \begin{pmatrix} 0 & 1\\ 1 & 0 \end{pmatrix}. $$

This has the effect of mapping $\mathsf{X}\ket 0=\ket 1$, which means directly that $\mathsf{X}\ket 1=\ket 0$ and therefore it is its own inverse: $\mathsf{X}\mathsf{X} = \mathsf{I}$ so $\mathsf{X}=\mathsf{X}^{-1}$.

We suggest re-reviewing how one interprets a matrix as an explicit mapping of each element of the basis, as it helps make sense of gates. In this tutorial, gate matrices are always specified in terms of the bit-string basis

$$ \{\ket{\ldots000}, \ket{\ldots001}, \ket{\ldots010}, \ket{\ldots011}, \ldots\}. $$

We again refer the reader to this paper for an in-depth discussion about this basis.

In the rest of this section, the whole goal is to be able to apply gates to our quantum state. There are two cases of pedagogical and operational interest: the one-qubit gate and the many-qubit gate. We will write two functions to accomplish each of these, in order to implement a general function called apply-gate for applying any kind of gate on any collection of qubits for any quantum state.

(defun apply-gate (state U qubits)
(assert (= (length qubits) (dimension-qubits (array-dimension U 0))))
(if (= 1 (length qubits))
(%apply-1Q-gate state U (first qubits))
(%apply-nQ-gate state U qubits)))

Gates on multi-qubit machines

If we are working with the machine $M_n$, then our space is $2^n$-dimensional, and as such, our matrices would be written out as $2^n\times 2^n$ arrays of numbers. If we can write out such a matrix, then applying it is as simple as a matrix–vector multiplication. For instance, for a $4$-qubit machine, an $\mathsf{X}$ on qubit $0$ would be written

$$ \begin{pmatrix} 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \end{pmatrix}, $$

which could be readily applied to a $16$-element quantum state vector. It is easy to verify that this will swap the components of $\ket{\ldots 0}$ with the corresponding components of $\ket{\ldots 1}$.

But as should be plainly obvious from the obnoxious amount of paper wasted by writing out this matrix, it would be better if we could simply generate this matrix with just three pieces of information: the gate matrix $g=\left(\begin{smallmatrix}0 & 1\\1 & 0\end{smallmatrix}\right)$, the qubit index $i=0$, and the size of the machine $n=4$. This is a process we will call lifting.

Lifting requires a fundamental tool for constructing operators on spaces that were formed out of tensor products. If we have two finite-dimensional vector spaces $U$ and $V$, and operators $f$ and $g$ on the spaces respectively, then it seems reasonable to consider how $f$ and $g$ transform $U\otimes V$. In some sense, applying $f$ and $g$ “in parallel” on $U\otimes V$ correspond to a new linear operator $h$. If $f$ and $g$ are matrices, then $h$ is defined by a block matrix

$$ \begin{equation} h_{i,j} = f_{i,j} g. \label{eq:kron} \end{equation} $$

More specifically, let $0 \leq i,j < \dim U$. The matrix $h$ will be an array of $\dim U \times \dim U$ copies of $g$, where the entries of the $(i,j)$th blocks are multiplied by the single scalar $f_{i,j}$. This will lead to a matrix with $(\dim U)(\dim V)$ rows and columns, which is exactly the dimension of $U\otimes V$. Incidentally, we write $h$ as $f\otimes g$, and this combination of operators is called the Kronecker product⁸. As code:

(defun kronecker-multiply (A B)
(destructuring-bind (m n) (array-dimensions A)
(destructuring-bind (p q) (array-dimensions B)
(let ((result (make-array (list (* m p) (* n q)))))
(dotimes (i m result)
(dotimes (j n)
(let ((Aij (aref A i j))
(y (* i p))
(x (* j q)))
(dotimes (u p)
(dotimes (v q)
(setf (aref result (+ y u) (+ x v))
(* Aij (aref B u v))))))))))))

As a matter of terminology, remember that tensor products combine vector spaces, and Kronecker products combine operator matrices.

Single-qubit gates and gates on adjacent qubits

From here, we can very easily lift one-qubit gates to machines with any number of qubits. A gate $g$ on qubit $i$ in an $n$-qubit machine is just $g$ applied to qubit $i$ and the identity $\mathsf{I}$ on all other qubits. Writing this out as a Kronecker product, we have

$$ \begin{equation} \operatorname{lift}(g, i, n) := \underbrace{\mathsf{I} \otimes \mathsf{I} \otimes \cdots}_{n-i-1\text{ factors}} \otimes g \otimes \underbrace{\cdots \otimes \mathsf{I}}_{i\text{ factors}}, \label{eq:liftone} \end{equation} $$

where there are a total of $n$ factors, and $g$ is at positioned $i$ factors from the right.

This concept generalizes to higher-dimensional operators which act on index-adjacent qubits. In other words, if $g$ is a $k$-qubit operator specifically acting on qubits

$$ (i+k-1, i+k-2, \ldots, i+2, i+1, i), $$

then the lifting operator from \eqref{eq:liftone} is much the same:

$$ \begin{equation} \operatorname{lift}(g, i, n) := \underbrace{\mathsf{I} \otimes \mathsf{I} \otimes \cdots}_{n-i-k\text{ factors}} \otimes g \otimes \underbrace{\cdots \otimes \mathsf{I}}_{i\text{ factors}}. \label{eq:liftmany} \end{equation} $$

It must be emphasized one last time: This only works for multi-qubit operators that act on qubits that are index-adjacent. We will get to how to work with non-adjacent qubits shortly, but first we will turn this into code.

For simplicity, we create a way to iterate a Kronecker product multiple times, that is, compute

$$ \underbrace{g\otimes \cdots \otimes g}_{n\text{ factors}}, $$

which is usually simply written $g^{\otimes n}$. We must use care when handling the case when we are “Kronecker exponentiating” by a non-positive number, so that $f\otimes g^{\otimes 0} = f$.

(defun kronecker-expt (U n)
(cond
((< n 1) #2A((1)))
((= n 1) U)
(t (kronecker-multiply (kronecker-expt U (1- n)) U))))

With kroncker-expt, we can write lift following \eqref{eq:liftmany}:

(defun lift (U i n)
(let ((left (kronecker-expt +I+ (- n i (dimension-qubits
(array-dimension U 0)))))
(right (kronecker-expt +I+ i)))
(kronecker-multiply left (kronecker-multiply U right))))

Multi-qubit gates on non-adjacent qubits

In this section, we assume we are working on a multi-qubit machine $M_n$ with $n\ge 2$.

The general idea

So far, we’ve managed to get away with lifting operators that act on either a single qubit, or a collection of index-adjacent qubits. This has been more-or-less trivial, because we can tack on a series of identity operators by way of Kronecker products to simulate “doing nothing” to the other qubits. However, if we want to apply a multi-qubit gate to a collection of qubits that aren’t index-adjacent, we have to be a little more clever.

The way we accomplish this is by swapping qubits around so that we can move in and out of index-adjacency. In fact, for a given gate acting on a given collection of qubits, we aim to compute an operator $\Pi$ which moves these qubits into index-adjacency, so that we can compute

$$ \begin{equation} \Pi^{-1} \operatorname{lift}(g, 0, n) \Pi. \label{eq:upq} \end{equation} $$

This recipe requires many ingredients, each of which we describe in detail.

Swapping two qubits

To start, we need some way to swap the state of two qubits. We can do this with the $\mathsf{SWAP}$ operator:

$$ \mathsf{SWAP} := \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1 \end{pmatrix}. $$

In Common Lisp, we define this in the same way we defined +I+.

(defparameter +SWAP+ #2A((1 0 0 0)
(0 0 1 0)
(0 1 0 0)
(0 0 0 1)))

The $\mathsf{SWAP}$ operator takes two qubits and swaps their state. What does this mean in a system of correlations, where qubit state isn’t strictly compartmentalized (i.e., factorized)? Swapping is equivalent to swapping the component of $\ket{01}$ with the component of $\ket{10}$, which are the only two distinguishable correlations⁹. Still, in a multi-qubit system, we can’t immediately arbitrarily swap two qubits with the tools we’ve developed. What we can do is swap index-adjacent qubits. In particular, we can define the transpositions

$$ \tau_i := \operatorname{lift}(\mathsf{SWAP}, i, n),\qquad \text{with }0\leq i < n - 1. $$

The transposition $\tau_i$ swaps qubit $i$ with qubit $i+1$. This is our first ingredient.

Re-arranging qubits to be index-adjacent

The second ingredient is a way to re-arrange our qubits so that they are index-adjacent. Suppose we have a three-qubit operator $g$ which acts on qubits $(2, 4, 3)$ in a machine of $n=5$ qubits. The space in which the quantum state of $M_5$ lives is

$$ B_4 \otimes B_3 \otimes B_2 \otimes B_1 \otimes B_0, $$

but we need to re-arrange our state vector as if we’ve moved $B_2\to B_0$, $B_4\to B_1$, and $B_3\to B_2$ so that our sub-state sits index-adjacent. In combinatorics, this permutation is written in two-line notation

$$ \begin{pmatrix} 0 & 1 & 2 & 3 & 4\\ 3 & 4 & 0 & 2 & 1 \end{pmatrix}. $$

Here, we’ve made a few arbitrary decisions. First, we’ve decided to re-map a $k$-qubit operator to the $B_{k-1}\otimes\cdots\otimes B_1\otimes B_0$ subspace. Any other index-adjacent subspace would work, but this simplifies the code. Second, we see that $0\mapsto 3$ and $1\mapsto 4$, but it doesn’t matter so much where they map to, as long as $2$, $4$, and $3$ are mapped correctly.

There’s no sense in writing the first line in two-line notation, so we just write the permutation compactly as $34021$. As a quantum operator, we write this as $\Pi_{34021}$.

The question is: How can we write $\Pi_{34021}$ as familiar operators? It is a well-known fact in combinatorics that any permutation can be decomposed into a composition of swaps, and every swap can be decomposed into a series of adjacent transpositions. We leave this as an exercise¹⁰, but we will show the code to our implementation.

We start with a function which takes a permutation written as a list, like (3 4 0 2 1), and converts it to a list of (possibly non-adjacent) transpositions to be applied left-to-right, represented as cons cells ((0 . 3) (1 . 4) (2 . 3)).

(defun permutation-to-transpositions (permutation)
(let ((swaps nil))
(dotimes (dest (length permutation) (nreverse swaps))
(let ((src (elt permutation dest)))
(loop :while (< src dest) :do
(setf src (elt permutation src)))
(cond
((< src dest) (push (cons src dest) swaps))
((> src dest) (push (cons dest src) swaps)))))))

Next, we convert these transpositions as cons cells to adjacent transposition indexes. This is straightforward. If we are swapping $(a,b)$ with $a<b$, then we transpose $(a, a+1)$, then $(a+1, a+2)$, and so on until $(b-1, b)$, followed by a reversal of each except $(b-1, b)$. We can simply write this chain of adjacent transpositions as $(a, a+1, \ldots, b-1, \ldots, a+1, a)$. In this example, we’d have the transposition indexes (0 1 2 1 0 1 2 3 2 1 2).

(defun transpositions-to-adjacent-transpositions (transpositions)
(flet ((expand-cons (c)
(if (= 1 (- (cdr c) (car c)))
(list (car c))
(let ((trans (loop :for i :from (car c) :below (cdr c)
:collect i)))
(append trans (reverse (butlast trans)))))))
(mapcan #'expand-cons transpositions)))

These are indexes $i_1, i_2, \ldots$ such that $\Pi = \cdots \tau_{i_2}\tau_{i_1}$

The last ingredient we need is inverting $\Pi$. If we have $\Pi$ represented as a sequence of $\tau$, then we simply reverse the list of $\tau$.

Using transpositions to implement multi-qubit gates

With all of these, we write what is perhaps the most important function of our interpreter.

(defun %apply-nQ-gate (state U qubits)
(let ((n (dimension-qubits (length state))))
(labels ((swap (i)
(lift +swap+ i n))
(transpositions-to-operator (trans)
(reduce #'compose-operators trans :key #'swap)))
(let* ((U01 (lift U 0 n))
(from-space (append (reverse qubits)
(loop :for i :below n
:when (not (member i qubits))
:collect i)))
(trans (transpositions-to-adjacent-transpositions
(permutation-to-transpositions
from-space)))
(to->from (transpositions-to-operator trans))
(from->to (transpositions-to-operator (reverse trans)))
(Upq (compose-operators to->from
(compose-operators U01
from->to))))
(apply-operator Upq state)))))

A few quick notes for comprehension:

The value of (swap i) is $\tau_i$ fully lifted.
The one-line zinger that defines transpositions-to-operator takes a list of transposition indexes and converts it into a unitary operator. It does so by doing what’s known in functional programming as a map-reduce, by first mapping $i\mapsto\tau_i$ and reducing by operator composition.
The variable from-space contains the permutation $p$ that encodes the space in which we’d like to act. This permutation is calculated based off of the qubits argument.
The variables from->to and to->from represent $\Pi_p$ and $\Pi^{-1}_p$ respectively.
The variable Upq is our fully lifted operator, exactly by way of \eqref{eq:upq}.

The function %apply-nQ-gate is what allows our interpreter to be so general. Making the interpreter more efficient ultimately is an exercise in making this function more efficient.

The only thing left to do is integrate all of the topics discussed hitherto into an interpreter!

An interpreter

The driver loop

The bulk of the interpreter has been written. We’ve described the semantics of the two instructions of interest: MEASURE and GATE. Now we create the interpreter itself, which is just a driver loop to read and execute these instructions, causing state transitions of our abstract machine. If we see a GATE, we call apply-gate. If we see a MEASURE, we call observe.

(defun run-quantum-program (qprog machine)
(loop :for (instruction . payload) :in qprog
:do (ecase instruction
((GATE)
(destructuring-bind (gate &rest qubits) payload
(apply-gate (machine-quantum-state machine) gate qubits)))
((MEASURE)
(observe machine)))
:finally (return machine)))

Efficiency

Performance-focused individuals will have noticed that this interpreter is pretty costly, in many ways. The biggest cost is also unavoidable: The fact that our state grows exponentially with the number of qubits. Real, physical quantum computers avoid this cost, which makes them alluring machines to both study and construct.

However, even with this unavoidable cost, this interpreter has been implemented for ease of understanding and not machine efficiency. Writing a faster interpreter amounts to avoiding the construction of the lifted operator matrices. This can be done with very careful index wrangling and sensitivity to data types and allocation. This is how the high-performance Quantum Virtual Machine is implemented.

Examples

What good is writing an interpreter if we don’t write any programs worth interpreting? Here are a few examples of programs.

Bell state

The Bell state is one which we’ve explored earlier. It is a two-qubit state $$\frac{1}{\sqrt{2}}(\ket {00} + \ket {11}).$$ Here’s a program to generate one, using two new gates, the controlled-not gate $\mathsf{CNOT}$ and the Hadamard gate $\mathsf{H}$.

(defparameter +H+ (make-array '(2 2) :initial-contents (let ((s (/ (sqrt 2))))
(list (list s s)
(list s (- s))))))
(defparameter +CNOT+ #2A((1 0 0 0)
(0 1 0 0)
(0 0 0 1)
(0 0 1 0))))
(defun bell (p q)
`((GATE ,+H+ ,p)
(GATE ,+CNOT+ ,p ,q)))

Greenberger–Horne–Zeilinger state

The Greenberger–Horne–Zeilinger state, or GHZ state, is a generalization of the Bell state on more than two qubits, namely $$\frac{1}{\sqrt{2}}(\ket{0\ldots 000} + \ket{1\ldots 111}).$$ This is accomplished by executing a chain of controlled-not gates:

(defun ghz (n)
(cons `(GATE ,+H+ 0)
(loop :for q :below (1- n)
:collect `(GATE ,+CNOT+ ,q ,(1+ q)))))

The quantum Fourier transform

The ordinary discrete Fourier transform of a complex vector is a unitary operator, and as such, it can be encoded as a quantum program. We will write a program which computes the Fourier transform of the probability amplitudes of an input quantum state (a time-domain signal), producing a new quantum state whose amplitudes represent components in the frequency domain. This is the central subroutine to Shor’s algorithm, which is a quantum algorithm which factors integers faster than any known classical method.

First, we will need a gate called the controlled-phase gate $\mathsf{CPHASE}(\theta)$:

(defun cphase (angle)
(make-array '(4 4) :initial-contents `((1 0 0 0)
(0 1 0 0)
(0 0 1 0)
(0 0 0 ,(cis angle)))))

Now, we can generate the quantum Fourier transform recursively.

(defun qft (qubits)
(labels ((bit-reversal (qubits)
(let ((n (length qubits)))
(if (< n 2)
nil
(loop :repeat (floor n 2)
:for qs :in qubits
:for qe :in (reverse qubits)
:collect `(GATE ,+swap+ ,qs ,qe)))))
(%qft (qubits)
(destructuring-bind (q . qs) qubits
(if (null qs)
(list `(GATE ,+H+ ,q))
(let ((cR (loop :with n := (1+ (length qs))
:for i :from 1
:for qi :in qs
:for angle := (/ pi (expt 2 (- n i)))
:collect `(GATE ,(cphase angle) ,q ,qi))))
(append
(qft qs)
cR
(list `(GATE ,+H+ ,q))))))))
(append (%qft qubits) (bit-reversal qubits))))

The program for a three-qubit quantum Fourier transform (qft '(0 1 2)) looks like this:

(
(GATE #2A((0.7071067811865475d0 0.7071067811865475d0) (0.7071067811865475d0 -0.7071067811865475d0)) 2)
(GATE #2A((1 0 0 0) (0 1 0 0) (0 0 1 0) (0 0 0 #C(0.0d0 1.0d0))) 1 2)
(GATE #2A((0.7071067811865475d0 0.7071067811865475d0) (0.7071067811865475d0 -0.7071067811865475d0)) 1)
(GATE #2A((1 0 0 0) (0 0 1 0) (0 1 0 0) (0 0 0 1)) 1 2)
(GATE #2A((1 0 0 0) (0 1 0 0) (0 0 1 0) (0 0 0 #C(0.7071067811865476d0 0.7071067811865475d0))) 0 1)
(GATE #2A((1 0 0 0) (0 1 0 0) (0 0 1 0) (0 0 0 #C(0.0d0 1.0d0))) 0 2)
(GATE #2A((0.7071067811865475d0 0.7071067811865475d0) (0.7071067811865475d0 -0.7071067811865475d0)) 0)
(GATE #2A((1 0 0 0) (0 0 1 0) (0 1 0 0) (0 0 0 1)) 0 2)
)

(Recall that #C(0 1) represents the complex number $i$.)

We can see the quantum Fourier transform in action by computing the Fourier transform of $\ket{000}$. Here is a transcript of this calculation:

CL-USER> (run-quantum-program
(qft '(0 1 2))
(make-machine :quantum-state (make-quantum-state 3)
:measurement-register 0))
#S(MACHINE
:QUANTUM-STATE #(#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0))
:MEASUREMENT-REGISTER 0)

Indeed, one can verify that the classical Fourier transform of the vector $[1,0,0,0,0,0,0,0]$ is a vector with eight components equal to about $0.35355$.

$ python
Python 2.7.16 (default, May 23 2023, 14:13:27)
[GCC 8.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.fft.fft([1,0,0,0,0,0,0,0], norm="ortho")
array([0.35355339+0.j, 0.35355339+0.j, 0.35355339+0.j, 0.35355339+0.j,
0.35355339+0.j, 0.35355339+0.j, 0.35355339+0.j, 0.35355339+0.j])

Example transcript

Here is an example transcript downloading and using this software, using Steel Bank Common Lisp.

$ git clone https://github.com/stylewarning/quantum-interpreter.git
Cloning into 'quantum-interpreter'...
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (10/10), done.
Unpacking objects: 100% (10/10), done.
remote: Total 10 (delta 2), reused 5 (delta 0), pack-reused 0
$ cd quantum-interpreter/
$ sbcl --noinform
* (load "qsim.lisp")
T
* (load "examples.lisp")
T
* (run-quantum-program (bell 0 1)
(make-machine :quantum-state (make-quantum-state 2)
:measurement-register 0))
#S(MACHINE
:QUANTUM-STATE #(0.7071067690849304d0 0.0d0 0.0d0 0.7071067690849304d0)
:MEASUREMENT-REGISTER 0)
* (run-quantum-program (qft '(0 1 2))
(make-machine :quantum-state (make-quantum-state 3)
:measurement-register 0))
#S(MACHINE
:QUANTUM-STATE #(#C(0.3535533724408484d0 0.0d0) #C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0) #C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0) #C(0.3535533724408484d0 0.0d0)
#C(0.3535533724408484d0 0.0d0) #C(0.3535533724408484d0 0.0d0))
:MEASUREMENT-REGISTER 0)
* (defun flip-coin ()
(machine-measurement-register
(run-quantum-program
`((GATE ,+H+ 0) (MEASURE))
(make-machine :quantum-state (make-quantum-state 1)
:measurement-register 0))))
FLIP-COIN
* (loop :repeat 10 :collect (flip-coin))
(1 1 0 1 1 0 0 1 0 1)
* (quit)

Source code

The source code in this tutorial are published under the BSD 3-clause license. The complete listing and most up-to-date source code can be found on GitHub.

Ports in other languages

Others have written this quantum interpreter in other languages. Here’s a list of ports that people have shared with me:

Aistis Raulinaitis’s implementation in OCaml
Graham Enos’s implementation in Rust with a write-up
Marco Rubin’s implementation in Python, no dependencies
Francesco Morri’s implementation in Python

A controlled one-qubit gate is a kind of two-qubit gate. ↩︎
It’s actually 124 SLOC, and it has not been “code golfed”. If we wanted to make an ever tinier quantum interpreter, we could—but brevity for its own sake is not the point. ↩︎
With only a little bit of extra work, mostly bookkeeping, we could make $n$ finite but unbounded during the execution of a program by having instead a collection of so-called quantum registers. These would be realized by instead a collection of $v$’s, which are opportunistically combined with entanglement occurs. ↩︎
For that matter, why complex numbers, and not just real-valued probabilities? The reason is that a complex number of unit norm can be written as $e^{i\theta}$, where $\theta$ is called the phase. Phases are a wave-like property, and allow the complex probability amplitudes to interfere. Interference is a known and understood phenomenon of quantum mechanical systems, and in fact is critical to the function of a quantum computer. ↩︎
Spaces with all of these properties, including a way to calculate distances, are called Hilbert spaces. ↩︎
The fact of the matter is that we can actual get more general by having classical probability distributions of these states, which leads one to so-called “density operators”. This is extremely useful when studying imperfect quantum computers which have noisy operations. ↩︎
While we won’t use this fact in our interpreter, even though it would be useful for error checking, it is very easy to check if a matrix is unitary. First, we compute another matrix $h$ which is the conjugate-transpose of $g$. The conjugate-transpose of a matrix is just the transpose of a matrix with each complex entry conjugated. Once we have this matrix, we check that $hg$ is an identity matrix. The matrix $g$ is unitary if and only if $hg=gh=\mathsf{I}$. ↩︎
Unfortunately, the definition \eqref{eq:kron} seems somewhat arbitrary and out of nowhere. Fortunately, there is a much more “first principles” approach to understanding the tensor product and the Kronecker product, starting with how we map a pair of vectors $v\in V$ and $w\in W$ to a vector $v\otimes w\in V\otimes W$. Such approach is much more satisfying to a mathematician, and even essential to understanding the “true nature” of the tensor product, but perhaps less so to a curious implementer. ↩︎
There is no sense in moving $\ket{00}$ or $\ket{11}$ to accomplish a swap-like operation, since we identify each qubits' respective $\ket 0$ identically, and each $\ket 1$ identically. ↩︎
If you’re not particularly keen to figure out the math yourself, you might consult Lemma 14.1 of these lecture notes. You’re also welcome to just take my word for it! ↩︎

Can a Rubik's Cube be brute-forced?

Fri, 07 Jul 2023 00:00:00 +0000

By Robert Smith

Introduction

When I was about 13, while still a middle-schooler, I became fascinated with the Rubik’s Cube¹. I never got terribly good at solving it, maybe eventually getting into the 30 to 40 seconds range. While I didn’t have a penchant for memorizing move sequences, I was drawn into how we find these move sequences.

The story about my interest and exploration in the Rubik’s Cube is for another post. Long story short, I got interested in “computer puzzling”—using computers to manipulate combinatorial puzzles, like the Rubik’s Cube, either to solve them quickly, to discover patterns, or to find novel move sequences for use in speedcubing—and ever since, I’ve been working on different programs for solving Rubik-like puzzles.

Purely in principle, it shouldn’t be hard to solve a Rubik’s Cube with a computer, right? Our program would have three parts:

A model of the Rubik’s Cube, that is, some data structure that represents a cube state.
Some functions which can simulate turns of each side.
A solving procedure which takes a scrambled cube, tries every possible turn sequence, and stops when solved.

Truth be known, and details aside, this is a provably correct method for solving a Rubik’s Cube. If you leave your computer on long enough, it will return a solution.

The problem is that it takes a long time. Probably longer than your lifetime.

Computer puzzling without brute-force

“Brute-force” generally means to try every possibility of something without much of any strategy. Our method above is a brute-force algorithm. Brute-force algorithms generally aren’t practical, because if you have $N$ of something to explore, a brute-force algorithm will take $O(N)$ time. For a Rubik’s Cube, $N$ is 43 quintillion—a very large number.

It has been known, practically since the Rubik’s Cube’s inception, that something else is needed to solve a Rubik’s Cube. Rubik’s Cube solutions, obviously, take into account the specific structure and properties of the cube so as to implicitly or explicitly avoid mindless search. These methods have turned out to be:

Solving methods for humans: memorize some sequences which let you move only a few pieces around in isolation, and apply these sequences mechanically until all pieces are in place. The more sequences you memorize, the faster you’ll be.
Heuristic tree search: do a tree search (with e.g., iterative-deepening depth-first search²), but aggressively prune off branches by way of clever heuristics³.
Phase-based solvers: a deeply mathematical way which involves characterizing the Rubik’s Cube as a sequence of nested (mathematical) subgroups so that each successive coset space small enough that it can be solved by computer.

Computer puzzling mostly deals with the latter two approaches, usually in some combination. Both approaches lead to extraordinarily high-performing solvers. For example:

Korf’s algorithm (approach #2) finds optimal solutions—solutions of shortest length—but can take hours to find one.
Thistlethwaite’s algorithm (approach #3) solves a cube in four phases almost instantaneously. The solutions are guaranteed to be no longer than triple the optimal length.

The story may as well end here. We have slow but optimal ways of solving the Rubik’s Cube, and fast but sub-optimal ways. Pick your poison (sub-optimal or slow), depending on what you’re trying to achieve.

Taking a step back: puzzles as permutations

It seems that any Rubik’s Cube solver has to know something about the structure of the cube. It might be worth asking how little structure we can get away with, so as to make whatever solving algorithm we write generic over a broad class of puzzles.

For a brute-force algorithm with tree search, we would need something like the following:

interface GenericPuzzle:
type State
type Move
function isSolved(State) -> Boolean
function allMoves() -> List(Move)
function performMove(State, Move) -> State

With this, we could write the following solver based off of iterative-deepening depth-first search, which is totally generic on the above interface.

function solve(State) -> List(Move)
function solve(p):
if isSolved(p):
return []
for maxDepth from 1 to infinity:
solved?, solution = dfs(0, maxDepth, p)
if solved?:
return solution
function dfs(Integer, Integer, State, List(Move)) -> (Boolean, List(Move))
function dfs(depth, maxDepth, p, s):
if isSolved(p):
return (True, s)
if depth == maxDepth:
return (False, [])
for m in allMoves():
p' = performMove(p, m)
(solved?, solution) = dfs(depth+1, maxDepth, p', append(s, [m])
if solved?:
return (solved?, solution)

As discussed before, while this strategy is effective for problems with small search spaces, it’s no help when the space is large. Unfortunately, the GenericPuzzle interface doesn’t give us much room for improvement. Can we still remain generic, while giving us at least a little more room for exploring other algorithms?

The answer is yes, if we restrict ourselves to permutation puzzles. Roughly speaking, a permutation puzzle is one where pieces shift around according to a fixed and always available set of shifting moves. The Rubik’s Cube is a phenomenal and non-trivial example: We can label each mobile⁴ sticker with a number 1 to 48, and these stickers can always be shifted around with a twist of any of the six sides. Since we can twist any of the six sides at any time, the puzzle is a permutation puzzle. (Not all similar puzzles are permutation puzzles. There are some puzzles which are “bandaged”, that is, pieces of the puzzle are fused together, restricting some available moves depending on the configuration.)

In this view, we looked at a solved configuration as a list of numbers. For example, the solved Rubik’s Cube as a permutation would be

$$ (1, 2, \ldots, 47, 48). $$

When we turn a side, these numbers get permuted. For instance, assuming a particular labeling of stickers with numbers, turning the top face of a Rubik’s Cube might permute the first sticker in the list to the third, the second sticker to the fifth, the third sticker to the eighth, etc. We can use the same notation

$$ (3, 5, 8, 2, 7, 1, \ldots) $$

This notation has two interpretations:

The literal position of numbered stickers on a physical cube (with an agreed upon labeling).
An instruction for how to relabel the stickers of a given cube.

If we look at the notation under the second interpretation, a permutation actually represents a function that’s applied to individual stickers. For instance, if

$$ F := (3, 5, 8, 2, 7, 1, \ldots) $$

then $F(1) = 3$, $F(2) = 5$, etc. All of the clockwise face turns—Front, Right, Up, Back, Left, Down—of a Rubik’s Cube can be described like so:

$$ \begin{align*} F &:= (1, 2, 3, 4, 5, 25, \ldots)\\ R &:= (1, 2, 38, 4, 36, 6, \ldots)\\ U &:= (3, 5, 8, 2, 7, 1, \ldots)\\ B &:= (14, 12, 9, 4, 5, 6, \ldots)\\ L &:= (17, 2, 3, 20, 5, 22, \ldots)\\ D &:= (1, 2, 3, 4, 5, 6, \ldots, 48, 42, 47, 41, 44, 46) \end{align*} $$

We wrote some of the last elements of $D$ because a “down” move doesn’t change the first six stickers in this labeling scheme.

This gives is a whole new interpretation of what it means to “solve” a cube. Given a scrambled cube, we first write down the permutation that describes how the stickers moved from a solved state to the scrambled state. Let’s call it $s$. This is easy, because we can just read the labeled stickers off of a cube one-by-one, in order. For example, $s$ might be:

$$ s := (27, 42, 30, 15, 39, 6, \ldots). $$

This is a description of a function! The value of $s(1)$ describes how the first sticker of a cube will be shifted to its scrambled position, in this case $27$. Next, solving a cube is finding a sequence of $k$ moves $m_1, m_2, \ldots, m_k$ such that, for all $1\leq i\leq 48$,

$$ i = m_k(m_{k-1}(\cdots(m_2(m_1(s(i)))))). $$

Stated another way in function composition notation, the function

$$ m_k \circ m_{k-1} \circ \cdots \circ m_2 \circ m_1\circ s $$

must be the identity function—a permutation that doesn’t move anything.

In the permutation puzzle way of thinking, we can still implement our GenericPuzzle interface:

State would be a permutation;
Move would also be a permutation;
isSolved would check if a permutation is $(1, 2, 3, \ldots)$;
allMoves would be a hard-coded list of the possible moves, like $F$, $R$, $U$, $B$, $L$, and $D$ for the Rubik’s cube; and
performMove would take the input move permutation, and apply it as a function to each element of the state permutation.

This might even be more efficient than another choice of representation, since permutations can be represented very efficiently on a computer as packed arrays of bytes!

But we didn’t do all this mathematical groundwork just to goof around; there’s something amazing lurking in these permutations.

Brute-force, still ignorant, but kinda smart?

In the late 1980s, Adi Shamir⁵ and his students made a brilliant series of observations that came together to make for a beautiful result. Unfortunately, to my knowledge, only two writings exist on the topic.

Shamir and his colleagues wrote a paper about it [1], sort of in the style of a brief conference proceeding, but it’s very light on details and skips implementation considerations. It’s the kind of paper where you follow it, but you have to fill in a great number of blanks to make anything from it work.
Shamir gave a talk sometime in the 80’s about his result, and somebody (none other than Alan Bawden) wrote a brief email [2] to a mailing list about his recollection of it.

An amazing result, buried in history, without any good exposition that I could find.

What’s the result? The essence of the result is this. Reminiscent of a “meet in the middle” algorithm, if we want to brute-force a problem that ordinarily requires visiting $N$ states to find an answer, we can instead cleverly split the work into two searches that requires visits to around $\sqrt{N}$ states. For a Rubik’s Cube, that cuts work associated with 43 quintillion states, down to work associated with 6 billion states. The best part is, this is still brute-force; virtually no knowledge of the structure of the problem is required to make it work.

Let’s walk through the requisite steps and build up to the result. I’ll attempt to write in a general framework (since it’s a general algorithm), but make frequent appeals to the Rubik’s Cube specifically.

Observation #1: decomposition as intersection

Suppose the following:

We have a mysterious permutation $s$, say, a scrambled puzzle;
We have two sets of permutations $X$ and $Y$; and
We assume there’s an $\hat x\in X$ and $\hat y\in Y$ such that $s = \hat y\circ \hat x$.

The goal is to find precisely $\hat x$ and $\hat y$ are. The simplest way to do this is to check every combination of elements in $X$ and $Y$.

for x in X:
for y in Y:
when s = compose(y, x):
return (x, y)

This will take time proportional to the product of the set sizes: $O(\vert X\vert\cdot\vert Y\vert)$. Shamir noticed the following: If $s=\hat y\circ\hat x$, then $\hat y^{-1}\circ s = \hat x$. With this, we preprocess our $Y$ set to be instead

$$ Y' := \{y^{-1}\circ s : y\in Y\}. $$

By doing this, there must be an element in common between $X$ and $Y'$, since $\hat x\in X$ and $\hat y^{-1}\circ s\in Y'$ and those are equal. So we’ve reduced the problem to determining what the intersection between $X$ and $Y'$ is.

Once we find our $z$ which is in common with $X$ and $Y'$, then our recovered permutation will be $\hat x = z$ and $\hat y = (z\circ s^{-1})^{-1}$.

We’ve just established that the problem of decomposing an element like $s$ is identical to the problem of calculating a set intersection. Still, if we want to do the intersection, our intuition tells us we still need a quadratic algorithm, which brings us to the second observation.

Observation #2: sorting really helps!

Permutations have a natural ordering, called lexicographic ordering. If you have two permutations, and you read their elements left-to-right, you can compare them like ordinary numbers. Just as $123 < 213$, we can say that

$$ (1,2,3) < (2,1,3). $$

A nice property of this is that the identity permutation $(1, 2, 3, \ldots)$ is the smallest permutation of a given size.

How does this help us? Well, suppose we sort our sets $X$ and $Y'$ into lists $L_X$ and $L_{Y'}$, so the permutations are in order. If $L_X$ and $L_{Y'}$ have an element in common, we can find it in linear time: $O(\min\{\vert X\vert, \vert Y'\vert\})$. How? Something like the following:

function findCommon(Lx, Ly):
x = pop(Lx)
y = pop(Ly)
loop:
if x == y:
return x
if empty(Lx) or empty(Ly):
error("No common elements found.")
if x < y:
x = pop(Lx)
else if x > y:
y = pop(Ly)

This works because we are essentially looking at all of the elements of $L_X$ and $L_{Y'}$ together in sorted order. It’s like a merge sort, without the merge part.

As written, findCommon computes just one element of the intersection. Instead of returning, the loop could continue to enumerate all elements. This is useful to know for the purpose of solving permutation puzzles: Do we want just some solution, or do we want all solutions? That answer, of course, depends on the application.

Before continuing, we should take a little scenic tour on a more formal meaning of “moves” and “move sequences”, since ultimately any permutation puzzle solving algorithm must produce them as output.

What is a move?

A quick bit about notation. If we have a permutation $f$, then its inverse is written $f^{-1}$, and it’s $k$-fold repetition $f\circ f\circ\cdots\circ f$ is written $f^k$. If we have a collection of permutations $S := \{f_1, f_2, \ldots\}$, then we write the following shorthands:

$$ \begin{align*} S^{-1} &:= \{f^{-1} : f \in S\}\\ S^{\times k} &:= \{f^k : f \in S\}. \end{align*} $$

If $g$ is some permutation, we also write these shorthands:

$$ \begin{align*} g\circ S &:= \{g\circ f : f \in S\}\\ S\circ g &:= \{f\circ g : f \in S\}. \end{align*} $$

Similarly, if $T := \{g_1, g_2, \ldots\}$, then we can write

$$ \begin{align*} S\circ T &:= \{f\circ g : f\in S, g\in T\}\\ &= \{f_1\circ g_1, f_2\circ g_1, \ldots, f_1\circ g_2, \ldots\}. \end{align*} $$

With that out of the way, let’s talk about the concept of a single “move”. What counts as a “move” in a permutation puzzle?

Really, we can choose any set of moves we please, so long as every state of the puzzle is reachable through some combination of the moves. For example, let

$$ C := \{F, R, U, B, L, D\}, $$

the basic and well understood ninety-degree clockwise moves of the Rubik’s Cube. Indeed, $C$ itself is a fine definition of available moves. All of the following are also valid definitions of moves:

$$ C\cup C^{-1},\quad C\cup C^{\times 2},\quad C^{-1},\quad C\cup C^{\times 2}\cup C^{-1}, $$

and so on. Perhaps surprisingly, we can take any element of $C$ and remove it, and it would still be a valid set of moves for the Rubik’s Cube⁶!

Which set of moves we select usually has little relevance mathematically (they are all expressible as one another), but has great relevance when we are synthesizing efficient move sequences, or when we want to talk about “optimality”. For instance, consider a counterclockwise move: $F^{-1}$. It’s natural to consider this a single move, but if we consider our set to be $C$, then we’d have to count it as three moves, since $F^{-1} = F\circ F\circ F = F^3$. What about $F^2$? Is that one move or two? Speedcubers generally consider $F^2$ to be one motion, so counting that as one move is natural, but many computer puzzlers like the simplicity of $C\cup C^{-1}$, i.e., only ninety-degree turns⁷.

For the rest of this note, we’ll be in the former camp, where half-turns count as one, and we’ll denote this set of moves as:

$$ \bar C := C \cup C^{-1} \cup C^{\times 2}. $$

What is a word?

After we agree on what we consider a move, we can be more specific as to what we mean about move sequences. A move sequence is a possibly empty list of moves. A move sequence can be composed to form the permutation it represents. This composition operator is called $\kappa$, and is easily defined. Let $M$ be a move set, and let $s = [s_1, s_2, \ldots, s_n]$ be a sequence of $n$ moves with each $s_{\bullet}$ a move from $M$. The length of $s$ is naturally $n$, and its composition is defined as:

$$ \begin{align*} \kappa([\,]) &:= (1, 2, 3, \ldots)\\ \kappa([s_1, s_2, \ldots, s_{n-1}, s_n]) &:= \kappa([s_1, s_2, \ldots, s_{n-1}])\circ s_n. \end{align*} $$

If $M$ is a move set, then the set of all move sequences (including the empty sequence) is denoted $M^{*}$, a notation kindly borrowed from formal language theory.

If we identify the elements of $M$ with symbols, then a move sequence is called a word. We’ll always type symbols in $\texttt{typewriter}$ font. The moves $\{F, R, U, B, L, D\}$ have the symbols $\{\texttt{F}, \texttt{R}, \texttt{U}, \texttt{B}, \texttt{L}, \texttt{D}\}$, an inverse $F^{-1}$ has the symbol $\texttt{F'}$, and a square $F^2$ has the symbol $\texttt{F2}$. And we type words as symbols joined together in reverse order⁸, so $[R^{-1}, U^2, L]$ can be represented by the word $\texttt{L U2 R'}$.

The distinction is subtle but important. In a computer program, a move sequence is a list of permutations, while a word is a list of symbols. A Rubik’s Cube solving program should take as input a permutation, and output a word which when composed as a move sequence, brings that permutation to identity.

When doing math, we often mix up all of these concepts since they have little bearing on the correctness of an argument. Whether it’s the permutation $F\circ R^{-1}$ or the move sequence $[F, R^{-1}]$ or the word $\texttt{R' F}$ or otherwise, they all represent roughly the same thing, but computers need to be explicit about which representation is being manipulated.

So, in summary:

A move set is a set of permutations that “count” as one move.
A move sequence is a list of moves from a move set.
The composition of a move sequence is the permutation that move sequence represents.
A symbol is a designator for a move in a move set.
A word is a sequence of symbols.

Back to this brute-force thing…

Observation #3: sorting as solving

As silly as the example is, let’s suppose we know, for a fact, that a Rubik’s Cube was mixed up using only six moves from $\bar C$. Since $\bar C$ has 18 elements, without any optimization, we might have to try $18^6$ move sequences to find a solution.

Instead of brute-forcing in that way, we can do another trick. Let s be our scrambled permutation.

Write out every combination of 3 moves into a table. The key would be the permutation, and the value would be the word associated with that permutation. Call this table A.
Sort A in ascending lexicographic order on the permutation.
Make a copy of A, call it B. For all (perm, word) in B, reassign perm := compose(invert(perm), s). We do this because of Observation #1.
Sort B.
Call x := findCommon(A, B). We do this via Observation #2.
Reconstruct a word equal to s by A[x].word ++ reverse(B[x].word). We do this to recover a final result via Observation #1.

Since we have a word that brings us from solved to s, we can invert the word to bring us from s to solved.

By this method, we avoided visiting all $16^6$ move sequences by instead pre-calculating two groups of $16^3$ sequences and exploring them for an intersection. We have cut the amount of work down to its square root.

If we generalize to length $n+m$ (for some splitting of $n$ and $m$), then we can replace the work of visiting $16^{n+m}$ states with $16^m + 16^n$ states, which is much better.

So we’re done? We now know that the Rubik’s Cube requires no more than 20 moves, so if we make two tables enumerating 10 moves, we should be good?

Well, err, $16^{10} = 1,099,511,627,776$. Unless we have trillions of resources to space, be it time or space, it’s still not going to work.

More splitting?

An enterprising computer science student, at this point, might smell recursion. If we split once, can we split again? If we know a Rubik’s Cube can be solved in 20 moves, can we split it into two 10 move problems, and each of those into two 5 move problems?

The problem with this is that at the top layer of recursion, it’s clear what we are solving. At lower layers, it’s no longer clear. What actually is the recursive structure at play? And if we could do this trick, couldn’t we decimate any brute-force problem of exponential complexity (e.g., in number of moves) into one of linear?

That isn’t going to work, but we can be inspired by it. Let $L$ be the set of 5-or-fewer-move combinations from $\bar C$, that is,

$$ L := \bigcup_{i=0}^5 \bar C^i. $$

The size of $L$ is going to be $621,649$ if we don’t store redundant permutations. This is definitely possible to compute. Then our goal is to find a decomposition of $s$ in terms of an element in $L\circ L\circ L\circ L$. Using the same trick from Observation #1, suppose there is a decomposition $$s = l_4\circ l_3\circ l_2\circ l_1.$$ Then $$l_3^{-1}\circ l_4^{-1} \circ s = l_2\circ l_1.$$ So we create four tables:

$L_1 = L$,
$L_2 = L_1$,
$L_4 = L_1^{-1}$, and
$L_3 = L_4\circ s$.

No, the $4$ before $3$ is not a typo! We put this in order to save on computation and avoid redundant work. Now our goal is to find an element in common between the two sets

$$ \begin{align*} X &= L_2 \circ L_1\\ Y &= L_4 \circ L_3. \end{align*} $$

Somehow, we must do this without actually calculating all elements of $L_i\circ L_j$. And, to add insult to injury, for findCommon to work, we need to be able to go through the set in sorted order.

Iterating through products with Schroeppel–Shamir

Suppose we have two lists of positive numbers $A$ and $B$. How can we print the elements of $\{a+b : a\in A, b\in B\}$ in numerical order without explicitly constructing and sorting this set? Shamir and his collaborator Schroeppel did so with the following algorithm.

Sort $A$ in ascending order. Pop off the first (and therefore smallest) element $a_1$.
Create a priority queue $Q$ and initialize it with $(a,b)$ with priority $a_1 + b$ for all $b\in B$.
Repeat the following until $Q$ is empty:
1. Pop $(a,b)$ off $Q$. This will form the next smallest sum, so print $a+b$.
2. Find $a'$ which immediately succeeds $a$ in our sorted list $A$.
3. Push $(a',b)$ with priority $a+b$ onto $Q$.

This algorithm will terminate, having printed each sum successively with at most $O(\vert A\vert + \vert B\vert)$ space and almost linear time. (The sorting and priority queue maintenance require some logarithmic factors.)

With a little work, one can see why this works. In a sense it’s a two-dimensional sorting problem, that depends on one crucial fact: If $x \le y$ then $x+z \le y+z$. (This is to say that addition is monotonic.) Given how the priority queue is constructed, it will always contain the smallest sum.

Could we do this with permutations? If we have two lists of permutations $A$ and $B$, and $a_1$ is the “smallest” (i.e., lexicographically least) permutation of $A$, and $b_1$ is the “smallest” permutation of $B$, then it is patently not true that $a_1\circ b_1$ is the smallest element of $A\circ B$. In symbols,

$$ (\min A) \circ (\min B) \neq \min (A\circ B). $$

Similarly, if two permutations satisfy $a < b$, then it is patently not true that

$$ a\circ z < b\circ z $$

for a permutation $z$.

The monotonicity of addition is what allows us to do steps 3.2 and 3.3 so easily. If we did the same with permutations, we would no longer have the guarantee that the minimum composition exists within the queue.

This was the next hurdle Shamir cleared. Constant in the size of $A$ or $B$, Shamir found a way to solve the following problem: Given a permutation $a\in A$ and $b\in B$, find the element $b'\in B$ such that $a\circ b'$ immediately succeeds $a\circ b$. In other words, we can generate, one-by-one, a sequence of $b$’s needed for step 3.2 and 3.3. With this algorithm (which we’ll describe in the next section), our Shamir–Schroeppel algorithm for permutations becomes the following:

Algorithm (Walk Products):

Initialize an empty priority queue $Q$ whose elements are pairs of permutations with priority determined by another permutation in lexicographic ordering.
For each permutation $b\in B$:
1. With Shamir’s trick, find the $a\in A$ such that $a\circ b = \min (A\circ b)$.
2. Push $(a, b)$ onto $Q$ with priority $a\circ b$.

(Invariant: At this point, we will certainly have $\min (A\circ B)$ in the queue.)

Repeat the following until $Q$ is empty:
1. Pop $(a,b)$ off $Q$. This will form the next smallest $a\circ b$, so print it⁹.
2. With Shamir’s trick, find $a'$ such that $a'\circ b$ immediately succeeds $a\circ b$.
3. Push $(a',b)$ with priority $a'\circ b$ onto $Q$.

This algorithm will produce the elements of $A\circ B$, one-by-one in lexicographic order.

What is Shamir’s trick? We need a data structure and a clever observation.

Permutation tries

In order to handle sets of ordered permutations better, Shamir created a data structure. I call it a permutation trie. A permutation trie of size-$k$ permutations is a $k$-deep, $k$-ary tree, such that a path from root-to-leaf follows the elements of a permutation. The leaf contains data which we want to associate with the permutation.

For example, consider permutations of size $5$. Suppose we wanted to associate the symbol $\texttt{p6}$ with the permutation $(2,4,1,3,5)$. Then we would have a $5$-layer tree with a root node $R$, such that $R[2][4][1][3][5] = \texttt{p6}$.

More generally, let’s associate the following symbols with the following permutations in a permutation trie $R$:

$$ \begin{align*} \texttt{p1} &\leftarrow (1,2,3,4,5) & \texttt{p2} &\leftarrow (1,2,3,5,4) & \texttt{p3} &\leftarrow (1,2,4,3,5)\\ \texttt{p4} &\leftarrow (1,2,5,3,4) & \texttt{p5} &\leftarrow (1,3,4,5,2) & \texttt{p6} &\leftarrow (2,4,1,3,5)\\ \texttt{p7} &\leftarrow (4,1,3,2,5) & \texttt{p8} &\leftarrow (4,1,3,5,2) & \texttt{p9} &\leftarrow (5,1,2,3,4)\\ \end{align*} $$

The trie would be a data structure that looks like this:

Even though we don’t show them, conceptually, each node in the trie has a full length-$5$ array, with some elements empty (i.e., there are no children).

What’s good about this data structure? First and foremost, pre-order traversal will visit the permutations in lexicographic order. We can use this data structure to store two things at the leaves (i.e., $\texttt{p}n$):

The actual permutation data structure representing that path, and
The word we used to construct that permutation.

This is the data structure, and now we get to Shamir’s insight. Suppose we have a permutation $s$ and a permutation trie $R$ (which represents a set of permutations), and we want to traverse $s\circ R$ in lexicographic order. The naive way is to construct a new trie, but we wish to avoid that. To explain the idea, we’ll choose a concrete example.

Let’s use $R$ from above. Let $s := (3,1,4,2,5)$. (Note that $s\not\in R$, but that’s not important.) We wish to find an $r'\in R$ such that $s\circ r' = \min (s\circ R)$. Well, the smallest permutation would be one such that $r'(1) = 2$, because then $s(r'(1)) = s(2) = 1$. Looking at our trie $R$, we can see the only candidate is that associated with $\texttt{p6}$: $(2,4,1,3,5)$, which is the minimum.

What about the next smallest $s\circ r''$? For ease, let’s call this product $m$. We would want a permutatation such that $r''(1) = 4$, because $m(1) = s(r''(1)) = s(1) = 2$. This time, there are two candidates:

$$ (4,1,3,2,5)\qquad (4,1,3,5,2) $$

So at least we know $m = (2, \ldots)$. To disambiguate, we need to look at $r''(2)$. These are the same, likewise $r''(3)$, so we have no degree of freedom at $2$ or $3$ to minimize the product. Thus $m = (2, 3, 4, \ldots)$. We have a choice at $r''(4)$, however. The best choice is $r''(4) = 2$, because $m(4) = s(r''(4)) = s(2) = 1$, the smallest possible choice. This disambiguates our choice of $r''$ to be $(4,1,3,2,5)$ so that $m = (2,3,4,1,5)$.

We could repeat the procedure to find the next smallest product $s\circ r'''$. What exactly is the procedure here? Well, we walked down the tree $R$, but instead of walking down it straight, we instead did so in a permuted order based on $s$—specifically $s^{-1}$. Consider our normal algorithm for walking the tree¹⁰ in lexicographic order:

function walkLex(R):
if notTree(R):
print R
else:
for i from 1 to length(R):
if R[i] exists:
walkLex(R[i])

We can instead walk in permuted order, so that we produce a sequence $[r, r'', r''', \ldots]$ such that

$$ s\circ r < s \circ r' < s \circ r''' < \cdots, $$

we modify our walking algorithm as so:

function walkProductLex(R, s):
walk'(R, inverse(s))
function walk'(R, s):
if notTree(R):
print R
else:
for i from 1 to length(R):
j = s(i)
if R[j] exists:
walk'(R[j], s)

Note that $s$ was inverted before the recursion to make quick permuting of each node.

With this, we have the remarkable ability to iterate through products in lexicographic order, without having to enumerate them all and sort them. This was the last and critical ingredient.

The 4-List Algorithm and solving the Rubik’s Cube

Now we want to put this all together to create the 4-List Algorithm. Let’s restate the problem in clear terms.

Problem (4-List): Let $s$ be a permutation. Let $L_1$, $L_2$, $L_3$, and $L_4$ be sets of permutations such that we know $s\in L_4\circ L_3\circ L_2\circ L_1$. Find $l_1\in L_1$, $l_2\in L_2$, $l_3\in L_3$, and $l_4\in L_4$ such that $s = l_4\circ l_3\circ l_2\circ l_1$.

Piecing together the elements above, we arrive at the 4-List Algorithm.

Algorithm (4-List):

Construct $L'_3 := L_3^{-1}\circ s$ and $L'_4 := L_4^{-1}$.
Create two generators¹¹: $X_1$ that walks $L_2\circ L_1$ in lexicographic order, and $X_2$ that walks $L'_3\circ L'_4$ in lexicographic order. Do this by using the Walk Products algorithm, which itself is implemented by constructing permutation tries and using walkProductLex.
Call findCommon on $X_2$ and $X_1$. This is guaranteed to find a solution $(l_3^{-1},l_4^{-1}\circ s,l_2,l_1)$. Process the solution to return $(l_4, l_3, l_2, l_1)$.

The main difficulty of this algorithm, aside from implementing each subroutine correctly, is plumbing the right data around.

Now, we can use this to solve a scrambled Rubik’s Cube $s$.

Algorithm (Solve Cube):

Let $L := \bigcup_{i=0}^5\bar C^i$, keeping a record of the words used to construct each element of $L$. (We recommend immediately making a permutation trie, where the leaves store the words.)
Apply the 4-List Algorithm to the problem $s \in L\circ L\circ L\circ L$ to emit $(l_4, l_3, l_2, l_1)$.
Return words $(w_4, w_3, w_2, w_1)$ associated with the permutations $(l_4, l_3, l_2, l_1)$.

(Invariant: The length of the solutions will be at most $20$.)

The algorithm may terminate upon finding just the first solution, or may be continuously run to find all solutions. If we do the latter, we guarantee finding all optimal solutions.

Amazingly, this algorithm really works, and answers our blog post question in the affirmative: yes, the Rubik’s Cube can be brute-forced.

Example and source code

This algorithm is implemented in Common Lisp, in my computational group theory package CL-PERMUTATION. CL-PERMUTATION already has built in support for Rubik’s Cubes as permutation groups. Starting a new Common Lisp session, we have the following:

> (ql:quickload '(:cl-permutation :cl-permutation-examples))
> (in-package :cl-permutation)
> (group-order (perm-examples:make-rubik-3x3))
43252003274489856000
> (format t "~R" *)
forty-three quintillion two hundred fifty-two quadrillion three trillion
two hundred seventy-four billion four hundred eighty-nine million
eight hundred fifty-six thousand
NIL

The built-in Rubik’s Cube model only uses $\{F, R, U, B, L, D\}$, so we make new generators corresponding to $\bar C$.

> (defvar *c (loop :with cube := (perm-examples:make-rubik-3x3)
:for g :in (perm-group.generators cube)
:collect (perm-expt g 1)
:collect (perm-expt g 2)
:collect (perm-expt g 3)))
*C
> (length *c)
18

Now we construct $\bar C^5 \cup \bar C^4 \cup \cdots \cup \bar C^0$.

> (defvar *c5 (generate-words-of-bounded-length *c 5))
*C5
> (perm-tree-num-elements *c5)
621649

Note that this constructs a perm-tree object, which automatically stores the words associated with each permutation generated.

Now let’s generate a random element of the cube group.

> (defvar *s (random-group-element (perm-examples:make-rubik-3x3)))
*S
> *s
#<PERM 43 44 41 20 47 11 28 9 24 13 17 42 36 40 37 25 6 21 1 29 7 19 10 3 35 39 22 18 34 33 31 48 16 15 30 2 23 32 26 46 8 4 27 12 45 14 5 38>

Lastly, we run the 4-list algorithm and wait.

> (decompose-by-4-list *s *c5 *c5 *c5 *c5 :verbose t)
10,000,000: 52 sec @ 192,553 perms/sec; .0013% complete, eta 1114 hours 58 minutes
20,000,000: 48 sec @ 206,858 perms/sec; .0026% complete, eta 1037 hours 51 minutes
Evaluation took:
145.094 seconds of real time
145.097120 seconds of total run time (144.961382 user, 0.135738 system)
[ Run times consist of 2.405 seconds GC time, and 142.693 seconds non-GC time. ]
100.00% CPU
421,375,385,955 processor cycles
11,681,934,352 bytes consed
((8 11 14 2 4)
(1 16 9 15 1)
(7 6 18 8 15)
(9 13 16 15 8))

We are pretty lucky this one ended in a mere 2 minutes 25 seconds! It usually isn’t so prompt with an answer.

The results are printed as four words: our $l_4$, $l_3$, $l_2$, and $l_1$. Each integer $n$ represents the 1-indexed $n$th permutation of $\bar C$ (ordered by how it was constructed). We can create a more traditional notation:

> (defvar *solution (reduce #'append *))
*SOLUTION
> (defun notation (ws)
(dolist (w (reverse ws))
(multiple-value-bind (move order)
(floor (1- w) 3)
(format t "~C~[~;2~;'~] "
(aref "FRUBLD" move)
order))))
NOTATION
> (notation *solution)
U2 L' D L U' L' U2 D' R' U F L' U' D F R F2 L2 B2 U2

How do we know if this is correct? We need to check that the composition of this word equals our random element, which we do by composing the word (using something CL-PERMUTATION calls a “free-group homomorphism”), inverting the permutation, and composing it with our scramble to see that it brings us to an identity permutation.

> (defvar *hom (free-group->perm-group-homomorphism
(make-free-group 18)
(generate-perm-group *c)))
*HOM
> (perm-compose (perm-inverse (funcall *hom *solution)) *s)
#<PERM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48>

Indeed, we found a reconstruction of our cube.

Tips for optimizing the 4-List Algorithm

One of the most troubling aspects of implementing this algorithm is making it fast enough. My initial implementation worked at a whopping 200 permutations per second. That’s incredibly slow, and meant that it would take well over a century (in the worst case) for my program to finish. Now, it works at about 190,000 permutations per second, with an estimated worst-case search time of 2 months. (I haven’t encountered a scrambled cube position which has taken more than 10 hours.)

Here are some ways I sped things up.

Be economical with memory. When doing exploratory programming, it’s desirable to tag and store everything, but each of those storages and accesses take time.
Don’t use actual arrays in the permutation trie. When I did that, I ran out of memory. I instead opted for a sparse representation using an “a-list” (that is, a linked list of (index, value) pairs).
Make the permutation handling fast, like composition, equality testing, and lexicographic ordering. I was originally using generic arithmetic and 64-bits to represent each permutation element, and it degraded speed.
Use a good priority queue implementation. You’ll be pushing and popping hundreds of millions of elements.
Do some analysis and compress the permutation trie representation. Most nodes of the trie will only contain one value. If that’s the case, just store instead the permutation (and whatever value associated with it) at the shallowest depth. This will save a lot of time by avoiding a lot of needless (permuted) recursion.

If you have other tips for speeding up the algorithm, please email me!

Sample benchmarks

In the following, we only consider the problem of solving the Rubik’s Cube using the 4-list algorithm, assuming a solution length of 20 moves.

My computer is a ThinkPad 25th Anniversary Edition. It has an Intel Core i7-7500U processor at 2.70 GHz, but boosting to 3.50 GHz. It has 32 GiB of RAM, but comfortably runs the solver with around 3–4 GiB.

The algorithm as implemented is able to check around 190,000 elements per second.

Generating the move lists and pre-processing is a relatively fixed cost. The lists can be generated once, but the preprocessing (i.e., composing the scramble with one of the lists) needs to happen each solve. In my implementation, the initialization cost is consistently 9 seconds.

After initialization, the search is conducted. The run time varies wildly, anywhere from seconds to hours.

64 s, 188 billion CPU cycles, 4 GiB of allocation
165 s, 480 billion CPU cycles, 12 GiB of allocation
2210 s, 6 trillion CPU cycles, 162 GiB of allocation
4613 s, 13 trillion CPU cycles, 356 GiB of allocation
24010 s, 70 trillion CPU cycles, 2 TiB of allocation

These are randomly sampled Rubik’s Cube scrambles, sorted by time.

In principle, with the current level of optimization, the algorithm can take as much as 2 months to finish. I’m confident that my implementation can be brought down a factor of 2, less confident it can be easily brought down a factor of 50—but it wouldn’t surprise me either way.

One interesting thing about this algorithm is that it seems to return very, very quickly if the solution is 10 or fewer moves. Why? I haven’t done a careful analysis, but I believe it is essentially because the solution will be in $L_2\circ L_1$. The permutations $l_3$ and $l_4$ will be identity, which reduces to the problem of just finding $s\in L_2\circ L_1$.

Conclusion

“Meet in the middle” algorithms are old and well understood. When we can’t brute-force an entire space, we can try splitting it in two and try to combine them. That’s of course the spirit of the 4-List Algorithm, but the devil is always in the details, and I hope this blog post showed a lot of disparate facts needed to come together to realize the algorithm.

I think the algorithm communicated by Shamir and his colleagues has been remarkable but forgotten. While better algorithms exist for the specific task of solving the Rubik’s Cube, the generality of the 4-List Algorithm ought not be understated.

References

A. Fiat, S. Moses, A. Shamir, I. Shimshoni and G. Tardos, “Planning and learning in permutation groups,” 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, NC, USA, 1989, pp. 274–279, doi: 10.1109/SFCS.1989.63490. (Link)
A. Bawden. “Shamir’s talk really was about how to solve the cube!”. Alan Bawden. From the Cube Lovers mailing list. 27 May 1987. (Link)

The Rubik’s Cube? Why not just “Rubik’s Cube”?! ↩︎
Iterative-deepening depth-first search (IDDFS) is an interesting hybrid between breadth-first and depth-first search. Breadth-first search (BFS) can find an optimal path to a target, but requires lots of memory to keep track of nodes that have been seen. Depth-first search (DFS) uses almost no memory, but can’t guarantee finding the shortest path. IDDFS is an algorithm which tries DFS up to a maximum depth of 1, then of 2, then of 3, etc. until a path to the target is found. While we re-visit nodes in each successive increase in the maximum depth, the savings in memory and the guarantee of finding the shortest path usually make it worth it. ↩︎
A heuristic might be something like this. First, suppose we’ve built a table which maps every corner configuration (ignoring edges) to the number of moves needed to solve it. This problem can be brute-forced, as there are “only” $8!\cdot 3^7=88,179,340$ corner configurations. Suppose we are doing IDDFS to solve a whole Rubik’s Cube, and the algorithm is currently at a depth limit of 10. During our DFS (with a limited depth), we arrive at a position at depth 7, and want to decide if we shall continue with it. We can consult our corner configuration table: If we would require more than 3 moves to solve just the corners, then there’s no hope in continuing, since we’ll exceed our depth limit of 10. So we drop the line of search on this configuration entirely by returning from the depth-7 recursive call empty-handed. ↩︎
The centers are typically seen as immobile, and hence aren’t numbered. ↩︎
Shamir, isn’t that name familiar? Yes, he’s the ‘S’ from “RSA”, the encryption algorithm for which he and colleagues ‘R’ Rivest and ‘A’ Adleman won a Turing award. ↩︎
Formally, any subset of five elements of $C$ generates the Rubik’s Cube group. ↩︎
Rubik’s Cube enthusiasts have names for these concepts. If we measure the length of a move sequence by the number of quarter turns, we say we are measuring in the quarter-turn metric or QTM. If instead we are measuring the length of a move sequence by the number of face turns of any degree, we say we are measuring in the half-turn metric or HTM. ↩︎
Speedsolvers like to write words in last-to-first order, so they can read off the moves as they’re applied. ↩︎
A note on the phrase “print it”. We use the term “print it” to signify that the permutation has been constructed and it may be consumed. We might not literally print it, and instead emit it for use. What this means precisely depends on the programming language you’re using. In our final algorithm, we’ll actually need to explicitly construct generators, so keep that in mind. ↩︎
Again, as in the other footnote, we can see “walking” or “printing” or … as again a manifestation of a process of generating something one-by-one. ↩︎
Construct generators?! This is the third footnote dedicated to walking/printing/generating, because it’s important and sometimes difficult. Making a generator may be utterly trivial in your language (Scheme with call/cc or Python with yield), cumbersome (Common Lisp with cl-cont), or downright annoying. One trick we used when implementing the algorithm in Common Lisp is to keep track of where we are in the permutation trie by a permutation itself. We can always go to the next one if we can find the current one. ↩︎

A software engineer's circuitous journey to calculate eigenvalues

Wed, 10 Aug 2022 00:00:00 +0000

Or, how to calculate the eigenvalues and eigenvectors of a complex matrix using a routine that only works on real matrices.

By Robert Smith

If we have a complex matrix, how do we calculate its eigenvalues and eigenvectors using a procedure that can only work on real matrices? This recounts my journey on solving that problem and where it came from in the first place.

TL;DR: If you came here just looking for an algorithm, scroll to the bottom for a pseudocode listing.

Why?

MAGICL is a Common Lisp library for doing matrix arithmetic. To make a long story short, there’s a desire to reduce MAGICL’s dependence on foreign libraries (e.g., LAPACK), and instead use pure Common Lisp routines. Except, implementing numerical linear algebra is difficult, and the MAGICL maintainers usually have more important things to work on. So, instead of writing routines from scratch via textbooks, we sometimes resort to mechanically translating an old distribution of LAPACK, written in FORTRAN 77, into Common Lisp. Due to the age of the routines, I personally think it’s prudent to minimize its usage.

One routine that’s generally reliable—as both FORTRAN 77 code as well as its mechanically translated Common Lisp counterpart—is DGEEV, a LAPACK routine to compute eigenvalues and eigenvectors of a general real-matrix of double-precision floats. (We’ll call a set of eigenvectors and the corresponding eigenvalues the eigensystem.) This routine is made nice in MAGICL and exposed as the Lisp function MAGICL:EIG.

The MAGICL:EIG function, however, is required to be able to work with both real and complex matrices, yet DGEEV only works with reals. So, we are left with two reasonable options:

Get mechanical translation of complex-matrix BLAS and LAPACK functions working so that we can call the complex-matrix counterpart of DGEEV called ZGEEV, or
Figure out how to only use DGEEV to somehow compute the eigensystem of a complex matrix.

Because I like puzzles, and because I’m loathe to add to the existing virtually unmaintainable pile of 50,000 lines of mechanically translated code, I opted for the latter option.

Complex numbers as matrices

Complex numbers look like pairs of real numbers that have funny rules for addition and multiplication. More precisely, complex numbers form a two-dimensional real vector space, and multiplication is in fact an $\mathbb{R}$-linear map. As such, if we specify a basis, we can write a matrix. Consider the $\mathbb{R}$-basis $\hat e_0:=1$ and $\hat e_1:=i$ as well as the multiplication map $z\mapsto (a+bi)z$. Applying this map to our basis gives

$$ \begin{align*} (a+bi)\hat e_0 &= (a+bi)\cdot 1 & (a+bi)\hat e_1 &= (a+bi)\cdot i \\ &= a+bi & &= -b+ai \\ &= a\hat e_0+b\hat e_1 & &= -b\hat e_0 + a\hat e_1 \end{align*} $$

We thus immediately conclude

$$ \begin{pmatrix} \hat e_0 & \hat e_1 \end{pmatrix} (a+bi) = \begin{pmatrix} a & b \\ -b & a \end{pmatrix} \begin{pmatrix} \hat e_0 \\ \hat e_1 \end{pmatrix}. $$

This is to say that the multiplication map $z\mapsto (a+bi)z$ can be represented by the matrix

$$ \begin{pmatrix} a & b \\ -b & a \end{pmatrix}. $$

(This representation is not unique. We could also exchange the $b$ and $-b$ for a different representation, as many texts do. This is because complex conjugation—which this exchange of off-diagonal elements represents—is an isomorphism on $\mathbb{C}$.)

One can verify that this matrix can be both added and multiplied, and it works as expected.

Complex matrices as real matrices

We have given a way to identify elements of $\mathbb{C}$ as elements of $\mathbb{R}^{2\times 2}$, the set of $2\times 2$ real matrices. This readily gives us a representation for $\mathbb{C}^{n\times n}$ matrices: If we have a matrix $U\in\mathbb{C}^{n\times n}$, produce a matrix $V\in\mathbb{R}^{2n\times 2n}$ by replacing each $U_{r,c}$ with our real-matrix representation:

$$ \begin{pmatrix} V_{2r,2c} & V_{2r,2c+1} \\ V_{2r+1,2c} & V_{2r+1,2c+1} \end{pmatrix} := \begin{pmatrix} \Re U_{r,c} & \Im U_{r,c} \\ -\Im U_{r,c} & \Re U_{r,c} \end{pmatrix}. $$

For example, this is a transformation from an element of $\mathbb{C}^{2\times 2}$ to an element of $\mathbb{R}^{4\times 4}$.

$$ \begin{pmatrix} 1+2i & 3-4i \\ 5-6i & -7+8i \end{pmatrix} \mapsto \begin{pmatrix} 1 & 2 & 3 & -4 \\ -2 & 1 & 4 & 3 \\ 5 & -6 & -7 & 8 \\ 6 & 5 & -8 & -7 \end{pmatrix}. $$

Due to how matrix arithmetic works with block matrices (which these real matrices essentially are), we at least get ordinary addition and multiplication in this representation.

Does this mean we can now just apply any linear algebra routine to these real matrices and get the same answer as we would for complex matrices? Unfortunately not, and eigensystems are a perfect example way. How many eigenvalues (at most) does an operator in a $d$-dimensional vector space have (over a field of characteristic zero)? Well, $d$, since for square invertible matrices, the eigensystem forms a basis in which the matrix at hand diagonalizes. So just by that, we won’t have the same number of eigenvalues, and thus the results of linear algebra routines—at least one that computes eigensystems—can’t be the same.

Some experimentation

I was experimenting with computing eigensystems of complex matrices and their corresponding real variants, and noticed a pattern. If the complex matrix had a real eigenvalue, it would be also show up as an eigenvalue (of double multiplicity) of the real matrix. If the complex matrix had a complex eigenvalue, then it would show up as an eigenvalue of the real matrix, along with its complex conjugate.

This lead me to the following conjecture:

Conjecture: Let $U\in\mathbb{C}^{n\times n}$, and let $V\in\mathbb{R}^{2n\times 2n}$ be the real matrix corresponding to $U$ according to the aforementioned transformation. If $a+bi$ is an eigenvalue of $U$, then $a\pm bi$ are two eigenvalues of $V$.

With this conjecture and my chin up, I could implement a routine to compute eigenvalues of $U$ using just a real eigenvalue algorithm. The way I did it was to write a procedure to find the true conjugates amongst the complete set. The way to do this is roughly as follows.

First, let $E$ be our multiset of eigenvalues of $V$ (the real matrix), but delete duplicate real values (they’ll show up in pairs), and delete complex eigenvalues that have a negative imaginary part (there will always be a corresponding conjugate).

Second, recall that $\operatorname{Tr} U$ is the sum of the eigenvalues of $U$. This will be a complex number whose real part is simply recovered by summing the real parts of the eigenvalues:

$$ \Re (\operatorname{Tr} U) = \sum_{e\in E} \Re e. $$

This fact isn’t computationally useless; it can be verified as a sanity check in code immediately because there is no ambiguity in the real parts, if the conjecture is true.

The imaginary part is a little trickier, since there is ambiguity stemming from uncertainty around which conjugate is actually an eigenvalue of $U$. As such, there must be a sequence of $\vert E\vert$ signs $s_{\bullet}\in\{-1,+1\}$ such that

$$ \Im (\operatorname{Tr} U) = \sum_{k=0}^{\vert E\vert-1} s_k \Im e_k. $$

(The ordering of $e_{\bullet}$ doesn’t matter; any will do.)

Though asymptotically inefficient, the values for $s_k$ can be solved by brute force: keep trying until you find the set that works. As it turns out, there’s not a lot better you can do, since the solution to this sequence-of-signs problem can be reduced to solving the subset-sum problem.

While perhaps a neat conjecture, this wasn’t very satisfying to me. I simultaneously hadn’t fully solved what I set out to solve (finding the eigensystem, not just the eigenvalues), and I had a nagging conjecture that exposed my lack of knowledge about the problem.

Proving the conjecture

Ultimately, I had to go back to basics, ask my friends and family for help, and try to break through. I figured that while the conjecture wasn’t ultimately very computationally useful on its own, if I could prove it, maybe it would give me enough insight mathematically and computationally to come up with something better.

After some trial and error, I settled on trying to use the following fact.

Fact: Let $A'$ and $A''$ be square matrices, and let $A := A'\oplus A''$. Then the eigenvalues of $A$ will be the union of the eigenvalues of each $A'$ and $A''$.

This is readily seen by computing the characteristic polynomial. Let $\mathbb{C}[\lambda]$ be our polynomial ring. Then the characteristic polynomial of $A$ equals the product of the polynomials of $A'$ and $A''$:

$$ \det (A - \lambda I_{\dim A}) = \det (A' - \lambda I_{\dim A'}) \det (A'' - \lambda I_{\dim A''}) $$

This ended up being a crucial insight, as we’ll see.

Another fact I needed was the following. As a matter of notation, let $\bar A$ denote the complex conjugate of each entry of $A$.

Fact: Let $A$ be a complex matrix. If $a+bi$ is an eigenvalue of $A$, then its conjugate $a-bi$ is an eigenvalue of $\bar A$.

This is seen by, again, looking at the characteristic polynomial and using the properties of complex conjugation:

$$ \begin{align*} \det (\bar A - \lambda I_{\dim A}) &= \det (\bar A - \overline{\bar\lambda I_{\dim A}}) \\ &= \det (\overline{A - \bar\lambda I_{\dim A}}) \\ &= \overline{\det (A - \bar\lambda I_{\dim A})}. \end{align*} $$

These facts were enough for me to refine the conjecture:

Conjecture (redux): Let $U\in\mathbb{C}^{n\times n}$, and let $V\in\mathbb{R}^{2n\times 2n}$ be the real matrix corresponding to $U$ according to the aforementioned transformation. Then $V$ is similar to $U\oplus \bar U$ when $V$ is trivially interpreted as a real matrix in $\mathbb{C}^{2n\times 2n}$.

This new conjecture is equivalent to the old one by way of those two facts.

Now, things started to look good. If I could find a similarity transform in $\mathbb{C}^{2n\times 2n}$ that block-diagonalizes $V$, and show that such a diagonalization is exactly $U\oplus \bar U$, then I’d be golden.

Since we are “allowed” to work over $\mathbb{C}$, the first step was to actually undo the complex-to-real transformation. However, since we are building a similarity transform, we need it to be invertible.

Again, after trial and error, I found that

$$ \left[ \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -i \\ 1 & i \\ \end{pmatrix} \right] \begin{pmatrix} a & b \\ -b & a \end{pmatrix} \left[ \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -i \\ 1 & i \\ \end{pmatrix} \right]^{-1} = \begin{pmatrix} a+bi & 0 \\ 0 & a-bi \end{pmatrix}. $$

Only until I constructed this matrix, let’s call it

$$ K := \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -i \\ 1 & i \\ \end{pmatrix}, $$

did I have a big “aha” moment. I was hitherto so focused on the (wrong) idea that our complex-to-real transformation was unique or canonical. I “knew” that we could choose either position of $b$ or $-b$ to represent either a number or its conjugate, but I didn’t think deeply enough about the repercussions of that fact. With $K$, it was apparent that our real matrix actually, in some sense, holds both a complex number and its conjugate.

At this point, it was obvious what to do. This was the critical insight.

Our matrix $K$ just works for $2\times 2$ matrices in $\mathbb{R}^{2\times 2}\subset \mathbb{C}^{2\times 2}$. We can extend it by using this little rule of linear algebra. If

$$ X := \begin{pmatrix} X_{0,0} & X_{0,1} & \cdots & X_{0,c-1} \\ X_{1,0} & X_{1,1} & & \\ \vdots & & \ddots & \vdots \\ X_{r-1,0} & & \cdots & X_{r-1,c-1} \end{pmatrix} $$

is a block matrix and $D := \Delta\oplus\cdots\oplus \Delta$ is a block diagonal matrix with $X_{\bullet}$ and $\Delta$ square and having the same shape, then

$$ (DX)_{r,c} = \Delta X_{r,c} \qquad\text{and}\qquad (XD)_{r,c} = X_{r,c}\Delta, $$

i.e., multiplication of these matrices results in $\Delta$ getting “applied” to each block. As such,

$$ K^{\oplus n} V (K^{\oplus n})^{-1} $$

will be a block matrix equivalent to substituting each disjoint $2\times 2$ sub-matrix of $U$ with the matrix like $\operatorname{diag}(z,\bar z)$ where $z$ is calculated as described.

We’re still not where we’re at. This transformed matrix will be a checkerboard pattern of $U$-likes on even-even- and odd-odd-indexed entries, and zeros on even-odd- and odd-even-indexed positions. The last necessary bit then to finish our similarity transform is to permute this matrix in such a way that all positive-signed conjugates are in the top-left $n\times n$ sub-matrix, and all negative-signed conjugates are in the bottom-right $n\times n$ sub-matrix. If we take for granted that permutations are invertible, then we’re done proving it. If we want to construct something, then we observe that all positive-signed conjugates have even indexes, and all negative-signed conjugates have odd indexes, then we define the invertible map

$$ (\Pi X)_{r,c} := \begin{cases} X_{2r,2c} & \text{if }0\leq r,c < n\\ X_{2r+1, 2c+1} & \text{if }n\le r,c < 2n\\ X_{2r, 2c+1} & \text{if }0\leq r < n\land n \leq c < 2n\\ X_{2r+1, 2c} & \text{if }n\leq r < 2n\land 0 < c \leq n \end{cases} $$

We can recover the matrix for $\Pi$ by applying it to the identity matrix.

And with that, we have a similary transform:

$$ (\Pi K) V (\Pi K)^{-1} = U\oplus \bar U. $$

Since eigenvalues are preserved under similarity, we’ve proved the conjecture.

Revisiting the computation

We have proved the conjecture, but does that actually get us any further in our quest to compute the eigensystem of a complex matrix using an algorithm for real matrices?

It does; we now know that the eigenvalues of $V$ are completely contained in the set for $U$. One thing we haven’t addressed, however, are the eigenvectors.

If we return to thinking about direct sums, then the eigenvectors of a matrix $A := A'\oplus A''$ are going to be eigenvectors of $A'$ and $A''$ “lifted” to the larger sum of spaces. In other words, if $x$ is an eigenvector of $A'$, then $x\oplus \vec 0_{\dim A''}$ is an eigenvector of $A$, where $\vec 0$ denotes a vector of zeros (i.e., $x$ is padded with zeros).

As such, in our block-diagonal basis, the eigenvectors of $V$ are related to the eigenvectors of $U$ in the following way. Suppose $(\lambda, x)$ is an eigenvalue-eigenvector pair of $U$. Then $Ux = \lambda x$. This directly implies that $\bar U \bar x = \bar\lambda\bar x$. Since $V\sim U\oplus\bar U$, $x\oplus\vec 0_{n}$ and $\vec 0_n\oplus\bar x$ are eigenvectors of $V$.

All that’s left to determine is: Which eigenvector is the right one without doing a costly similarity transform?

To do this, we “disembed” the eigenvector from the vector space of $V\sim U\oplus\bar U$ into the vector space of $U$ in such a way that the $\bar U$ subspace collapses to zero. We can do this easily. Our eigenvectors of $V$ without transformation are going to look like

$$ \begin{pmatrix} a+bi \\ -b+ai \\ c+di \\ -d+ci \\ \vdots \end{pmatrix} \qquad \text{and} \qquad \begin{pmatrix} a-bi \\ -b-ai \\ c-di \\ -d-ci \\ \vdots \end{pmatrix}, $$

where these correspond to eigenvalues $\lambda$ and $\bar\lambda$ respectively. One can see this by way of two facts:

The second vector is the entry-wise conjugate of the first vector, directly suggesting they’re each drawn from either of the eigenvector sets of $U$ or $\bar U$, and
each $2\times 1$ pair of entries in each vector corresponds to our basis vectors $(\hat e_{2k}, \hat e_{2k+1}) = (1,i)$ of our ambient vector space.

Also, notice the resemblance between pairs of entries in our first (“true”) eigenvector, and our complex number representation:

$$ \begin{pmatrix} a & b\\ -b & a \end{pmatrix} \qquad \text{and} \qquad \begin{pmatrix} a+bi \\ -b+ai \end{pmatrix}. $$

Taking either vector, we wish to annihilate the “wrong” one and send the “right” one to the space of $U$. Call either eigenvector $x\in\mathbb{C}^{2n}$ and the resulting vector $y\in\mathbb{C}^n$. Consider the map

$$ y_k = \frac{x_{2k} - ix_{2k+1}}{2} $$

for integers $0\le k < n$. (It is not actually necessary to divide by $2$, since if $y$ is an eigenvector, then so is $2y$.) With this map, the eigenvector $y$ of the conjugate ($\bar U$) space will vanish, or it will map to, for example, $a+bi$ in the ordinary ($U$) space, as desired. In the latter case, $y$ will be an eigenvector of $V$.

If $y = 0$, then it is discarded along with its corresponding eigenvalue, otherwise, the eigenvector and eigenvalue are kept, and we are done.

The final pseudocode

In summary, our algorithm to compute the eigensystem of a complex matrix is as follows.

INPUT:
n : an integer, the dimension of the problem
U : an n x n matrix of complex numbers
OUTPUT:
Lambda : a list of complex numbers, eigenvalues of U
Y : a list of complex n-vectors, eigenvectors of U
Step 1:
V : a 2n x 2n matrix of real numbers
Let V = a block matrix constructed by
expanding each element a+bi of U
into a matrix [a, b; -b, a]
Step 2:
Mu : a list of complex numbers
X : a list of complex vectors of dimension n
Let Mu, X = eigenvalues and eigenvectors of V
using a program to compute eigenvalues
of real numbers
Step 3:
Initialize Lambda = [] and Y = []
For mu, x in Mu, X:
y : a complex n-vector
For k from 0 to n-1:
Let y[k] = x[2*k] - i*x[2*k+1]
If y is a non-zero vector:
Push mu onto Mu
Push y onto Y

Much ado about nothing?

While this was an interesting puzzle, and leads to working code, was it all worth it? Honestly, I’m not sure it’s the best engineering decision. What happens when we need singular-value decomposition, or some other goofy matrix algorithm? It’s difficult to imagine the complex embedding trick will work well.

On the other hand, it saves us from using more antiquated FORTRAN 77 code than we need to. :)

Thanks to Juan Bello-Rivas, Erik Davis, Bryan Fong, Brendan Pawlowski, and Eric Peterson for insightful discussions.

Le blog est mort, vive le blog!

Thu, 04 Aug 2022 22:16:21 -0700

By Robert Smith

Between 2010 and 2014, I ran a WordPress blog. One day, the database was accidentally deleted, and I’ve since been too lazy to set something new up. But, as of late, two things have been happening:

I’ve been spending more time writing longer comments on r/lisp, Hacker News, etc. While I might feel proud of having written a good quality comment, I know that it’ll disappear into Internet history just a few days later, never to be read again, even by me.
I’ve been writing articles in LaTeX that feel awfully a lot like long-form blog posts. Except I never really had an avenue to publish them, and even if I did, very few people want to read informal PDFs.

Additionally, I’ve wanted to write about my piano journey. I’ve kept a personal and private video diary, but I find that the video diary entries are not very thoughtful and always quite rushed.

So, while I know personal blogs are no longer in fashion, I hope this marks a new beginning to my own personal writing journey!

Posts on stylewarning's screed

A tutorial quantum interpreter in 150 lines of Lisp

Contents

Introduction

A note about Common Lisp

A note to experienced quantum computing practitioners

The Language $\mathscr{L}$

The Quantum State

Where does one qubit live?

Many qubits

Bit-String notation and a general quantum state

Evolving the quantum state

Measurement

Gates

Gates as matrices

Gates on multi-qubit machines

Single-qubit gates and gates on adjacent qubits

Multi-qubit gates on non-adjacent qubits

The general idea

Swapping two qubits

Re-arranging qubits to be index-adjacent

Using transpositions to implement multi-qubit gates

An interpreter

The driver loop

Efficiency

Examples

Bell state

Greenberger–Horne–Zeilinger state

The quantum Fourier transform

Example transcript

Source code

Ports in other languages

Can a Rubik's Cube be brute-forced?

Contents

Introduction

Computer puzzling without brute-force

Taking a step back: puzzles as permutations

Brute-force, still ignorant, but kinda smart?

Observation #1: decomposition as intersection

Observation #2: sorting really helps!

What is a move?

What is a word?

Observation #3: sorting as solving

More splitting?

Iterating through products with Schroeppel–Shamir

Permutation tries

The 4-List Algorithm and solving the Rubik’s Cube

Example and source code

Tips for optimizing the 4-List Algorithm

Sample benchmarks

Conclusion

References

A software engineer's circuitous journey to calculate eigenvalues

Contents

Why?

Complex numbers as matrices

Complex matrices as real matrices

Some experimentation

Proving the conjecture

Revisiting the computation

The final pseudocode

Much ado about nothing?

Le blog est mort, vive le blog!