Cool. But this makes me wonder. This negates most of the advantages of C. Is the...

thechao · 2025-12-22T00:17:39 1766362659

At best you'd be restricted to the forward mode, which would still double stack pressure. If you needed reverse mode you'd need 2x stack, and the back sweep over the stack based tape would have the nearly perfectly unoptimal "grain". If you allows the higher order operators (both push out and pull back), you're going to end up with Jacobians & Hessians over nontrivial blocks. That's going to need the heap. It's still better than an unbounded loop tape, though.

We had all these issues back in 2006 when my group was implementing autograd for C++ and, later, a computer algebra system called Axiom. We knew it'd be ideal for NN; I was trying to build this out for my brother who was porting AI models to GPUs. (This did not work in 2006 for both HW & math reasons.)

spwa4 · 2025-12-22T09:35:18 1766396118

Why not recompile every iteration? Weights are only updated at the end of the batch size at the earliest, and for distributed training, n batch sizes at the fastest, and generally only at the end of an iteration. In either case the cost of recompiling would be negligeable, no?

thechao · 2025-12-22T12:56:23 1766408183

You'd pay the cost of the core computation O(n) times. Matrix products under the derivative fibration (jet; whatever your algebra calls it) are just more matrix products. A good sized NN is already in the heap. Also, the hard part is finding the ideal combination of fwd vs rev transforms (it's NP hard). This is similar to the complexity of finding the ideal subblock matrix multiply orchestration.

So, the killer cost is at compile time, not runtime, which is fundamental to the underlying autograd operation.

On the flip side, it's 2025, not 2006, so pro modern algorithms & heuristics can change this story quite a bit.

All of this is spelled out in Griewank's work (the book).

spwa4 · 2025-12-22T13:57:27 1766411847

This one? https://epubs.siam.org/doi/book/10.1137/1.9780898717761

thechao · 2025-12-22T20:07:30 1766434050

Yep. You can find used copies at some online places? Powell's in Portland (online store) sometimes has it for 25 or 30 $s.

marcthe12 · 2025-12-22T04:30:01 1766377801

We would need to mirror jax architecture more. Since the jax is sort of jit arch wise. Basically you somehow need a good way to convert computational graph to machine code while at compile time also perform a set of operations on the graph.

attractivechaos · 2025-12-22T00:38:08 1766363888

> Is there a compiler-autograd "library"?

Do you mean the method theano is using? Anyway, the performance bottleneck often lies in matrix multiplication or 2D-CNN (which can be reduced to matmul). Compiler autograd wouldn't save much time.

justinnk · 2025-12-22T07:29:02 1766388542

I believe Enzyme comes close to what you describe. It works on the LLVM IR level.

https://enzyme.mit.edu

sueszli · 2025-12-22T00:26:07 1766363167

a heap-free implementation could be a really cool direction to explore. thanks!

i think you might be interested in MLIR/IREE: https://github.com/openxla/iree