Autoresearch for agent harnesses
Andrej Karpathy recently released autoresearch, a project that showed you can give an AI agent a training script, let it experiment autonomously overnight, and wake up to a better model. It comes up with ideas, tests them against a single metric, keeps what works, and loops forever without any human input.
I wondered if I could apply the same idea to my work developing agents. Building an agent harness is a loop too: you write prompts, design tools, run the agent, analyze what failed, tweak the system, repeat. Could I take myself out of that process entirely and have a meta-agent do it for me?
In theory, yes. You need a configurable agent harness, a benchmark to score it against, and a meta-agent that can see what the inner agent did wrong and make changes. I set this up to test it.
For the inner agent - the one being optimized - I used Pi, a minimal, unopinionated, and extensible coding agent. To start, the inner agent had an empty system prompt and four tools: read, write, edit, and bash.
The meta-agent (also Pi) had the same tools + web search, and a detailed system prompt that gave it full visibility into the inner agent - conversation traces, diffs, test results - and the ability to edit the harness.
I initially chose SpreadsheetBench, a benchmark that evaluates an LLM's ability to manipulate spreadsheets, because it is applicable to my work at Brightwave. I picked a subset of six tasks and kicked it off.
In only a couple of experiments, the meta-agent had improved the inner agent's score on the benchmark from 17% to 50%. It was doing exactly what I do when I'm building agents: reading the agent's trace, building a mental model of what the inner agent was trying to do, diagnosing failures, and tweaking the system.
Some of the changes the meta-agent made to the harness were genuinely good. It noticed the agent was misinterpreting "answer position B2:F8" as "only write to cells B2:F8" rather than "we'll check B2:F8 to see if you got it right." The fix was one line in the system prompt: "Transform the entire spreadsheet, not just the answer range." A good fix that applied across tasks.
But I started looking more closely, and realized what the meta-agent was mostly doing: writing skill files with specific instructions to get tasks to pass. It wrote a skill file called text-replacement.md that contained the exact strings and exact cell ranges from one specific task.
In another instance, the meta-agent wrote a skill called force-excel-errors.md - a skill that literally instructs the inner agent to create spreadsheets that have errors in them. I dug into that one and realized the problem wasn't just overfitting - there were issues with the test suite - some of the ground truth answers were incorrect. The meta-agent was optimizing toward a broken target.
I concluded I needed a different benchmark and switched to SWE-bench Lite - a standard benchmark for evaluating coding agents, widely used by major labs and battle-tested enough that I could trust the signal. And to mitigate the overfitting, this time I used 25 tasks instead of six. Each experiment took about 50 minutes on my old MacBook, so I let it run overnight.
After 14 hours and 17 experiments, it never improved the score. The meta-agent tried prompt tweaks, skill files, tools, and extensions. Most changes caused regressions from the baseline 68% score.
It's possible that with more iterations, the agent could have made a breakthrough. Seventeen experiments isn't a lot, especially compared to Karpathy's autoresearch which runs 100+ overnight. Perhaps if the agent had kept going it would have tried more radical approaches - searched the web for latest research, experimented with fundamentally different architectures. But given that each experiment took about 50 minutes, and each one consumed a non-trivial amount of tokens across 25 separate agent runs, I stopped it there.
I may have gotten better results by giving it specific things to try e.g. suggesting strategies like adding a linter or having the inner agent run tests before submitting patches. But I chose not to because my hope was that the meta-agent would learn from its mistakes.
In retrospect, coding is probably the wrong domain for this - it's what these models are most heavily optimized for. There's less room for harness tweaks to matter. The meta-agent came to the same conclusion after its 17th straight failed experiment: "Baseline 0.68 is optimal and represents the agent's natural capability ceiling. ALL interventions degrade performance."
While I won't be using meta-agents to build agents from scratch, I will be using them as part of the development process. You can toss it an idea, let it iterate a few rounds, come back to results. At minimum, it handles the tedious optimization work. At best, it finds improvements I wouldn't have tried.
Code and results are available at github.com/nvonpentz/meta-agent.