Benchmarking AI on lambda calculus tasks

A tiny benchmark for checking whether models can actually handle lambda calculus, which is a nice change from the usual “looks smart in a demo, falls apart on a real task” routine.

This is the kind of benchmark I trust way more than “write a poem then explain it” stuff, because you can’t really vibe your way through beta-reduction. Curious how much of the score ends up being “can it stay consistent for 20+ steps” versus “did it memorize common encodings. ”

Can you explain in detail what beta-reductuon means?

Beta-reduction is basically “function application” in lambda calculus: when you have (λx. body) arg, you replace x inside body with arg (so (λx. x) y → y, and ). The one detail that trips people up is variable capture, so you sometimes rename bound variables first (α-conversion) to keep free variables free—for example, (λx. λy. x) y should reduce to λt. y, not λy. y.

The evaluation strategy matters a lot for benchmarking, because “beta-reduction” isn’t a single deterministic process once you have choices about which redex to fire (normal order vs applicative order, and even how aggressively you reduce under lambdas).

If LamBench doesn’t pin that down, two models could both be “correct” but produce different intermediate steps or even diverge vs terminate on the same term, which makes the score feel kind of arbitrary. I’m not sure what LamBench currently assumes here, but I’d want the benchmark to state the strategy explicitly (and probably treat α-conversion/capture-avoidance as part of the required semantics, not an optional nicety).