Idiomatic Lisp and the nbody benchmark
When talking to Lisp programmers, you often hear something like, “adapt Lisp to your problem, not your problem to Lisp.” The basic idea is this: if Lisp doesn’t let you easily write a solution to your problem because it lacks some fundamental constructs that make expressing solutions easy, then add them to Lisp first, then write your solution.
That sounds all good and well in the abstract, and maybe we could even come up with some toy examples—say, defining HTTP request routing logic in a nice DSL. But where’s a real example of this that’s not artificial or overengineered?
Recently, on Twitter, I butted into the middle of an exchange between @Ngnghm (a famous Lisp programmer) and @korulang (an account dedicated to a new language called Koru) about Lisp. I’m oversimplifying, but it went something like this:
- Lisp is slow.
- No it’s not!
- Yes it is!
- No it’s not!
- Then prove it!
Now, there’s plenty of evidence online that Common Lisp has reasonably good compilers that produce reasonably good machine code, and so the question became more nuanced: Can Lisp be realistically competitive with C without ending up being a mess of unidiomatic code?
Our interlocutor @korulang proposed a benchmark, the “nbody” benchmark from the Computer Language Benchmarks Game. This was of particular interest to them, because they used it as an object of study for their Koru language. To quote their blog post:
We wanted Koru kernels to land in the same ballpark as idiomatic C, Rust, and Zig.
The result was stronger than that.
Our fused n-body kernel, written in straightforward Koru kernel style, came in faster than the plain reference implementations. Every implementation here is “naive” — the obvious, idiomatic version a competent programmer would write in each language. No tricks, no hand-tuning, no -ffast-math: […]
and they proceeded to show Koru being 14% faster than C and 106% faster than Lisp.
Now, putting aside that some of the code and blog post were written with LLMs, there are many questions that are left unanswered here, since computer architecture and operating system matter a lot (where did these benchmarks run?). Moreover, the author buries the lede a little bit and proceeds to show how we might write “unidiomatic” C to match the performance of Koru.
I’m not concerned about nitpicking their approach or rigorously evaluating their claims, but I would like to dwell on this common refrain: “idiomatic”. What is that supposed to mean?
“Idiomatic code” in the context of programming means something like “representative of a fluent computer programmer” and “aligned with the peculiar characteristics of the language”. In some sense, idiomatic code in a particular language shouldn’t stand out amongst other code in that language, and idiomatic code should, in some sense, portray the identity of the language itself.
Idiomatic C is the C that uses terse names, simple loops, and unsafe arithmetic.
Idiomatic Haskell is the Haskell that uses short functions, higher-order abstractions, immutable data structures, and safe constructs.
What about idiomatic Lisp? Well, here’s the rub. A fluent programmer at Lisp doesn’t reach for one paradigmatic toolbox; they weave in and out of imperative, functional, object-oriented, etc. styles without much of a second thought. There’s a sort of “meta” characteristic to Lisp programming: you’re programming the language almost as much as you’re programming the program.
Yes, Lisp has loops, but “loopy code” isn’t intrinsically “Lispy code”. Yes, Lisp has objects, but “OOPy code” isn’t intrinsically “Lispy code”. In my opinion, what makes code “Lispy” is whether or not the programmer used Lisp’s metaprogramming and/or built-in multi-paradigm facilities to a reasonable degree to make the solution to their problem efficient and easy to understand in some global sense. For some problems, that may be “loopy” or “OOPy” or something else. It’s finding a Pareto-efficient syntactic and semantic combination offered by the language, or perhaps one of the programmer’s own creation.
So we get back to the @korulang benchmark challenge. Looking at their repository:
nbody.clooks like idiomatic C;nbody.hslooks like wildly unidiomatic Haskell, but the problem is, the idiomatic version would probably be slower;nbody.lisplooks reasonable, though it could easily be improved, but loopy; and- The Koru solution
kernel_fused.kzlooks idiomatic, as far as I can tell for not knowing anything about Koru.
I hesitate to say nbody.lisp is idiomatic. It’s reasonable, it’s straightforward to any imperative-minded programmer, but it’s not Lispy. That doesn’t make it good or bad, but it does lead to the grand question:
Can we use Common Lisp to express a solution to the nbody benchmark in a way that reads more naturally than a direct-from-C port?
I would say that, at face value, Koru’s solution is along the lines of what is more natural relative to the problem itself. Here are the essential bits.
~std.kernel:shape(Body) {
x: f64, y: f64, z: f64,
vx: f64, vy: f64, vz: f64,
mass: f64,
}
~std.kernel:init(Body) {
{ x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0, mass: SOLAR_MASS },
{ x: 4.84143144246472090e+00, y: -1.16032004402742839e+00, z: -1.03622044471123109e-01, vx: 1.66007664274403694e-03 * DAYS_PER_YEAR, vy: 7.69901118419740425e-03 * DAYS_PER_YEAR, vz: -6.90460016972063023e-05 * DAYS_PER_YEAR, mass: 9.54791938424326609e-04 * SOLAR_MASS },
{ x: 8.34336671824457987e+00, y: 4.12479856412430479e+00, z: -4.03523417114321381e-01, vx: -2.76742510726862411e-03 * DAYS_PER_YEAR, vy: 4.99852801234917238e-03 * DAYS_PER_YEAR, vz: 2.30417297573763929e-05 * DAYS_PER_YEAR, mass: 2.85885980666130812e-04 * SOLAR_MASS },
{ x: 1.28943695621391310e+01, y: -1.51111514016986312e+01, z: -2.23307578892655734e-01, vx: 2.96460137564761618e-03 * DAYS_PER_YEAR, vy: 2.37847173959480950e-03 * DAYS_PER_YEAR, vz: -2.96589568540237556e-05 * DAYS_PER_YEAR, mass: 4.36624404335156298e-05 * SOLAR_MASS },
{ x: 1.53796971148509165e+01, y: -2.59193146099879641e+01, z: 1.79258772950371181e-01, vx: 2.68067772490389322e-03 * DAYS_PER_YEAR, vy: 1.62824170038242295e-03 * DAYS_PER_YEAR, vz: -9.51592254519715870e-05 * DAYS_PER_YEAR, mass: 5.15138902046611451e-05 * SOLAR_MASS },
}
| kernel k |>
std.kernel:step(0..iterations)
|> std.kernel:pairwise {
const dx = k.x - k.other.x;
const dy = k.y - k.other.y;
const dz = k.z - k.other.z;
const dsq = dx*dx + dy*dy + dz*dz;
const mag = DT / (dsq * @sqrt(dsq));
k.vx -= dx * k.other.mass * mag;
k.vy -= dy * k.other.mass * mag;
k.vz -= dz * k.other.mass * mag;
k.other.vx += dx * k.mass * mag;
k.other.vy += dy * k.mass * mag;
k.other.vz += dz * k.mass * mag;
}
|> std.kernel:self {
k.x += DT * k.vx;
k.y += DT * k.vy;
k.z += DT * k.vz;
}
| computed c |>
capture({ energy: @as(f64, 0) })
| as acc |>
for(0..5)
| each i |>
captured { energy: acc.energy + 0.5*c[i].mass*(c[i].vx*c[i].vx+c[i].vy*c[i].vy+c[i].vz*c[i].vz) }
|> for(i+1..5)
| each j |>
captured { energy: acc.energy - c[i].mass*c[j].mass / @sqrt((c[i].x-c[j].x)*(c[i].x-c[j].x)+(c[i].y-c[j].y)*(c[i].y-c[j].y)+(c[i].z-c[j].z)*(c[i].z-c[j].z)) }
| captured final |>
std.io:print.blk {
{{ final.energy:d:.9 }}
}
Can we achieve something similar in Lisp?
First, let’s make a baseline. I’m running Ubuntu Noble with a “AMD RYZEN AI MAX+ PRO 395” with a clock speed that varies between 0.6–5 GHz. I am also using SBCL 2.6.3 and gcc 13.3. Using nbody.lisp as a starting point, I modified it for a few easy wins. I’ll call this version nbody-lisp-conventional. A quick benchmark reveals that the loopy Lisp code is only about 20% slower than the C code compiled with gcc -O3 -ffast-math -march=native.
$ ./nbody-lisp-conventional 50000000
-0.169286396
timing: 2000 ms
$ ./nbody-c 50000000
-0.169286396
timing: 1662 ms
As a Lisp programmer, it’s not surprising that it’s a little slower. The number of person-years that have gone into C compilers to optimize idiomatic C code makes the development effort behind SBCL, the most popular open-source Lisp compiler, look like a rounding error.
Now that we have a baseline, our goal is to come up with a nicer Lisp program that also improves the timing.
Our approach will be simple. We will create a library.lisp that contains new language constructs of a similar ilk to Koru, and we will use them to implement the nbody benchmark in impl.lisp. Some rules:
- No compile-time precomputation or caching. I can’t just compute the answer at compile time, or cache a sub-computation that makes the full one trivial.
- No fundamental algorithm changes. I can’t use a different integrator, for example.
- Using assembly is allowed, but it must only make use of the facilities offered by the Lisp compiler (i.e., no external tools), and the implementation of nbody itself must be understandable without knowing assembly. In other words, it should be sufficiently hidden, and in principle easily substitutable with portable code.
- Library code must be in principle useful for other similar tasks. It should not be hyper-specialized to this specific problem instance, but instead be useful for this general class of problems.
The third rule is more rigorous than it looks. It means we can’t just have a solve-nbody problem which dispatches to assembly.
To accomplish the above, we define a kernel DSL. The DSL allows us to express how elements of a composite transform, maintaining just enough invariants to allow them to be handled efficiently. These kernels are then compiled into efficient code, more efficient than ordinary loopy Lisp allows for.
Our attention will be focused on a proof-of-concept library of functionality for writing particle simulators. The operators we define are:
define-kernel-shape: Define the data to be transformed by each kernel. This would be the data to characterize the static and dynamic properties of a particle in motion, as well as the number of particles under consideration.define-kernel-step: Define a kernel as a sequence of existing ones.define-self-kernel: Define a read-write kernel that operates on each element independently, without access to other elements (i.e., a map operation).define-pairwise-kernel: Define a read-write kernel that operates on all pairs of elements, reduced by symmetry (i.e.,(i,j)and(j,i)are considered only once).define-reduction-kernel: Define a read-only kernel that does reduction of a sequence into a single value (i.e., a reduce operation).
This collection of five operators forms a miniature, re-usable language. These broadly recapitulate those of Koru, and allow us to write something that looks like this:
(defconstant +solar-mass+ (* 4d0 pi pi))
(defconstant +days-per-year+ 365.24d0)
(defconstant +dt+ 0.01d0)
(define-kernel-shape body 5
x y z vx vy vz mass)
(defparameter *system*
(make-body-system
(list :x 0d0 :y 0d0 :z 0d0
:vx 0d0 :vy 0d0 :vz 0d0
:mass +solar-mass+)
...))
(define-pairwise-kernel advance-forces (s body dt)
(let* ((dx (- i.x j.x))
(dy (- i.y j.y))
(dz (- i.z j.z))
(dsq (+ (+ (* dx dx) (* dy dy)) (* dz dz)))
(mag (/ dt (* dsq (sqrt dsq)))))
(let ((dm-j (* mag j.mass))
(dm-i (* mag i.mass)))
(decf i.vx (* dx dm-j))
(decf i.vy (* dy dm-j))
(decf i.vz (* dz dm-j))
(incf j.vx (* dx dm-i))
(incf j.vy (* dy dm-i))
(incf j.vz (* dz dm-i)))))
(define-self-kernel advance-positions (s body dt)
(incf self.x (* dt self.vx))
(incf self.y (* dt self.vy))
(incf self.z (* dt self.vz)))
(define-reduction-kernel (energy e 0d0) (s body)
(:self
(+ e (* (* 0.5d0 self.mass)
(+ (+ (* self.vx self.vx) (* self.vy self.vy))
(* self.vz self.vz)))))
(:pair
(let* ((dx (- i.x j.x))
(dy (- i.y j.y))
(dz (- i.z j.z)))
(- e (/ (* i.mass j.mass)
(sqrt (+ (+ (* dx dx) (* dy dy))
(* dz dz))))))))
(define-kernel-step run-simulation (system body n :params ((dt double-float)))
(advance-forces dt)
(advance-positions dt))
Well, in fact, this isn’t an ideal approximation, it’s almost exactly how it turned out. Given this is a proof of concept, we sometimes have to write some Lisp things a little funny. For example, you’ll notice we write:
(+ (+ (* dx dx) (* dy dy)) (* dz dz))
instead of the far more readable
(+ (* dx dx) (* dy dy) (* dz dz))
Both are completely valid and both can be used. So why the former? It is a result of a limitation of a little feature I built in: auto-vectorization. The vectorizer walks the mathematical expressions and replaces them with fast SIMD variants instead. Here’s a little fragment showing this rewrite rule:
...
(case (car expr)
;; (+ a (* b c)) -> fmadd(a,b,c)
((+)
(let ((args (cdr expr)))
(cond
((and (= (length args) 2) (mul-p (second args)))
`(%%fmadd-pd ,(xf (first args))
,(xf (second (second args)))
,(xf (third (second args)))))
...
The implementation of these kernel macros in library.lisp weighs in at just under 700 lines, and includes optional x64 SIMD auto-vectorization.
Well, for the nail biting moment, how does it compare? I made a Makefile that compares the idiomatic C against the loopy Lisp against our kernel DSL Lisp. It does a median-of-3. Running this on my computer gives:
$ make bench
=== C (gcc -O3 -ffast-math) ===
-0.169286396
runs: 1657 1664 1653 ms
median: 1657 ms
=== Lisp (SBCL, conventional loops) ===
-0.169286396
runs: 1991 2009 2005 ms
median: 2005 ms
=== Lisp (SBCL, kernel syntax) ===
-0.169286396
runs: 1651 1651 1652 ms
median: 1651 ms
So, in fact, we have matched the performance of C almost exactly. Furthermore, the generated code is still not as lean as it could be. Not to put too fine a point on it, but, <100 lines of Lisp, supported by
- 700 lines of library code and about 4 hours of my time; and
- 500k lines of its host compiler
sbcl
has performance parity and greater readability/reusability than <100 lines of C, supported by
- ~5,000k lines of just the C part of its host compiler
gcc.
None of this is to make an argument that Lisp is “better”, or that there isn’t merit to avoiding custom DSLs in certain circumstances, or that the world doesn’t have room for more custom home-grown compilers and parsers, but I think this is the clearest possible, quasi-realistic demonstration that idiomatic Lisp can be as fast as idiomatic C without tremendous work, whilst netting additional benefits unique to Lisp.
All code is available here.