Autoresearch for agent harnesses


Andrej Karpathy recently released autoresearch, a project that showed you can give an AI agent a training script, let it experiment autonomously overnight, and wake up to a better model. It comes up with ideas, tests them against a single metric, keeps what works, and loops indefinitely without any human input.

I wondered if I could apply the same idea to my work developing agents. Building an agent harness is a loop too: you write prompts, design tools, run the agent, analyze what failed, tweak the system, repeat. Could I take myself out of that process entirely and have a meta-agent do it for me?

In theory, yes. You just need a configurable agent harness, a benchmark to score it against, and a meta-agent that can see what the inner agent did wrong and make changes. I set this up to test it.

Diagram showing the meta-agent optimization loop: meta-agent analyzes results, modifies harness, runs benchmark with multiple inner agents, and receives scores and traces back

For the inner agent - the one being optimized - I used Claude Sonnet 4.6 with Pi, a minimal, unopinionated, and extensible coding agent. To start, the inner agent had an empty system prompt and four tools: read, write, edit, and bash.

The meta-agent (Claude Opus 4.6 with Pi) had the same tools + web search, and a detailed system prompt that gave it full visibility into the inner agent - conversation traces, diffs, test results - and the ability to edit the harness fully.

I initially chose SpreadsheetBench, a benchmark that evaluates an LLM's ability to manipulate spreadsheets, because it is applicable to my work at Brightwave. I picked a subset of six tasks and kicked it off.

In only a couple of experiments, the meta-agent had improved the inner agent's score on the benchmark from 17% to 50%. It was doing exactly what I do when I'm building agents: reading the agent's trace, building a mental model of what the inner agent was trying to do, diagnosing failures, and tweaking the system.

Some of the changes the meta-agent made to the harness were genuinely good. It noticed the agent was misinterpreting "answer position B2:F8" as "only write to cells B2:F8" rather than "we'll check B2:F8 to see if you got it right." The fix was one line in the system prompt: "Transform the entire spreadsheet, not just the answer range." A good fix that applied across tasks.

But I started looking more closely, and realized what the meta-agent was mostly doing: writing skill files with specific instructions to get tasks to pass. It wrote a skill file called text-replacement.md that contained the exact strings and exact cell ranges from one specific task.

In another instance, the meta-agent wrote a skill called force-excel-errors.md - a skill that literally instructs the inner agent to create spreadsheets that have errors in them. Looking closer, I realized the problem wasn't just overfitting - there were issues with the test suite - some of the ground truth answers were incorrect. The meta-agent was optimizing toward a broken target.

Since I couldn't trust SpreadsheetBench's answers, I concluded I needed a different benchmark and switched to SWE-bench Lite - a standard benchmark for evaluating coding agents, widely used by major labs and battle-tested enough that I could trust the signal. And to mitigate the overfitting, this time I used 25 tasks instead of six. Each experiment took about 50 minutes on my old MacBook, so I let it run overnight.

After 14 hours and 17 experiments, it did not improve the score once. The meta-agent tried prompt tweaks, skill files, tools, and extensions. Most changes caused regressions from the baseline 68% score.

It's possible that with more iterations, the agent could have made a breakthrough. Seventeen experiments isn't a lot, especially compared to Karpathy's setup, which ran 100+ in a single night. Perhaps if the meta-agent had kept going it would have tried more radical approaches - searched the web for latest research, experimented with fundamentally different architectures. But given that each experiment took about 50 minutes, and each one consumed a non-trivial amount of tokens across 25 separate agent runs, I stopped it there.

I may have gotten better results by giving it specific things to try, e.g. adding a linter or having the inner agent run tests before submitting patches. But I chose not to because my hope was that the meta-agent would learn from its mistakes.

In retrospect, coding may have been the wrong domain for this - it's what Claude is optimized for. There's less room for harness tweaks to matter. The meta-agent came to the same conclusion after its 17th straight failed experiment: "Baseline 0.68 is optimal and represents the agent's natural capability ceiling. ALL interventions degrade performance."

The meta-agent won't be building agents from scratch for me yet, but the infrastructure is worth keeping. Having an agent that can run a benchmark, read the inner agent's trace, and see exactly what failed makes it easy to toss it an idea and let it do the rest. I'll keep using it, just on a shorter leash.

Code and results are available at github.com/nvonpentz/meta-agent.