Rongchai Wang
Mar 23, 2026 20:27
Anthropic demonstrates multi-day autonomous AI workflows where Claude compressed months of physics research coding into a few days with minimal human oversight.
Anthropic just showed what happens when you let an AI work unsupervised for days on end. Their Claude Opus 4.6 model built a complex cosmological physics solver from scratch—work that typically takes researchers months or years—in a matter of days.
The $380 billion AI company published research on March 23 detailing how their latest model tackled implementing a differentiable Boltzmann solver, which predicts statistical properties of the Cosmic Microwave Background by simulating photons, baryons, neutrinos, and dark matter in the early universe. The kicker? The researcher overseeing the project, Siddharth Mishra-Sharma, admits it wasn’t even his core domain.
The Setup That Made It Work
Forget the typical AI chat loop where humans babysit every step. This approach sets Claude loose with clear success criteria and lets it run autonomously across multiple sessions. The model achieved sub-percent accuracy against CLASS, a reference implementation that cosmologists consider the gold standard.
Three components proved essential. First, a progress file (CHANGELOG.md) acts as the agent’s long-term memory between sessions, tracking completed tasks, failed approaches, and why they didn’t work. Without recording dead ends, successive sessions waste time re-attempting the same mistakes.
Second, a test oracle—in this case, the CLASS C source code—gives the agent an objective way to measure progress. Claude continuously ran unit tests against this reference, aiming for that 0.1% accuracy target.
Third, git commits after every meaningful unit of work create recoverable history. If compute allocation runs out mid-session, nothing gets lost. Mishra-Sharma monitored progress by checking GitHub on his phone while waiting in line for coffee.
The Ralph Loop Problem
Current models suffer from what Anthropic calls “agentic laziness”—they’ll find excuses to stop before finishing complex tasks. One model literally said, “It’s getting late, let’s pick back up again tomorrow.”
The workaround is the “Ralph loop,” essentially a for-loop that kicks the agent back into context when it claims completion and asks if it’s really done. Claude would iterate up to 20 times until genuinely finished.
Where It Struggled
The development trajectory wasn’t smooth. Claude initially tested code at only a single parameter point, drastically reducing its bug-catching ability. It spent hours chasing bugs that any cosmologist would spot instantly. It tripped over gauge conventions.
But it kept making progress. The resulting solver isn’t production-grade—accuracy falls short in certain regimes—yet it demonstrates genuine compression of researcher time.
What This Means for AI Development
This builds on Anthropic’s earlier C compiler project, where Claude worked across roughly 2,000 sessions to build a compiler capable of compiling the Linux kernel. The Boltzmann solver required different skills: tracing errors through a deeply coupled pipeline where small numerical mistakes cascade through everything downstream.
An unexpected side effect emerged. Mishra-Sharma learned substantial physics by watching the git commit history unfold. He described it as reading lab notes from “a fast, hyper-literal postdoc.”
For Anthropic, fresh off a $30 billion Series G round in February that valued the company at $380 billion, these demonstrations matter. They’re not just showing Claude can chat—they’re proving it can replace expensive, specialized labor over extended periods with minimal supervision. The question now becomes which industries figure out how to deploy this first.
Image source: Shutterstock





Be the first to comment