Karpathy's AutoResearch: What Happens When AI Does the Research Overnight
Andrej Karpathy released AutoResearch in March 2026, an open-source framework that sends AI coding agents to run machine learning experiments autonomously while you sleep. 65k GitHub stars in weeks. Here is what it actually does and why it matters.
Topics
In March 2026, Andrej Karpathy, OpenAI co-founder and former head of AI at Tesla, released a framework called AutoResearch. The premise is straightforward: you describe what you want to investigate in a text file, start the system before you go to sleep, and wake up to roughly 100 completed machine learning experiments with results ranked by performance. In three weeks, it reached 65,000 GitHub stars. The speed of adoption reflects something real about what the project represents, not just what it does.
What AutoResearch Actually Does
AutoResearch deploys an AI coding agent against a single training script. The agent modifies the script, runs a five-minute training experiment, measures the result using a validation metric called val_bpb (bits per byte, a measure of language model efficiency), and then decides whether to keep the change or discard it. If the change improves the score, it becomes the new baseline. If it does not, the agent rolls back and tries something else. This loop runs continuously, producing roughly 12 experiments per hour, or about 100 overnight.
The fixed five-minute time budget per experiment is a deliberate design choice. It makes results comparable across runs, prevents the agent from spending disproportionate time on any single hypothesis, and fits within the cost profile of a single H100 GPU running overnight. The constraint forces the system to work efficiently rather than exhaustively.
The Three-File Architecture
The system is organized around three files, each with a distinct role:
- prepare.py is fixed. It handles data preparation and never changes. This keeps the experimental substrate stable so that variations in results reflect actual model differences rather than data pipeline changes.
- train.py is the agent's canvas. It starts as a baseline training script and gets modified, extended, and refined by the agent over hundreds of iterations. By morning, it may look substantially different from what you started with.
- program.md is written by the human. This is where you describe your research strategy: what approaches to explore, what constraints to respect, what hypotheses to test. It is the only thing the human needs to write.
The simplicity is intentional. Keeping modifications to a single file (train.py) means every change is reviewable. You can look at the diff between the morning version and the starting point and understand what the agent actually did. This is harder to achieve when agents touch many files simultaneously.
You Are Writing the Research Strategy, Not the Code
Karpathy's framing of the human role is worth quoting directly. He describes it this way: "You are not writing the code directly 99% of the time. You are orchestrating agents." The human's job is to write program.md, which he calls the "research org code" - the high-level strategy that defines what the agent should pursue.
This is a meaningful shift from how most people currently think about AI coding tools. The popular framing positions AI as an assistant that helps you write code faster. AutoResearch inverts this: the agent writes the code, runs the experiments, and evaluates the results. The human writes the research direction. The work product of the human is the strategy document, not the implementation.
Whether this framing generalizes beyond ML research is an open question. But within the domain of iterative experimentation, where the goal is to search a large space of possible approaches and identify what works, it fits cleanly. The agent can search that space far faster than any human team.
What the Numbers Look Like
Karpathy ran AutoResearch on a personal project for two days and reported approximately 700 autonomous code changes. Of those, about 20 resulted in additive improvements that compounded into meaningful progress. The cumulative effect was an 11% efficiency gain on the Time to GPT-2 leaderboard, a benchmark that measures how efficiently a model can reach GPT-2-level performance.
The hit rate, roughly 3%, might sound low. But consider the alternative: a human researcher running 700 experiments manually would take months. The agent runs them overnight. The economics change completely when the cost of a failed experiment drops to five minutes of GPU time rather than days of human effort.
A Fair Comparison Mechanism
The fixed five-minute budget also solves a subtle problem in ML research: how do you fairly compare approaches that vary in computational complexity? If one technique requires twice as much compute, a longer training run would make it look better than it is. By holding time constant, AutoResearch ensures that improvements reflect genuine algorithmic gains rather than just "spend more compute" strategies.
Design Decisions That Matter
Several choices in AutoResearch's design reflect lessons from production ML systems that are worth noting:
These constraints make the system legible. A more powerful agent with fewer restrictions might produce faster results but harder-to-understand ones. AutoResearch trades some raw capability for interpretability, which matters if you want to actually learn from what the agent discovers.
The Broader Signal: Self-Improving AI
Karpathy's description of what AutoResearch represents is more significant than the tool itself. He calls it the beginning of the "self-improvement loopy era of AI": systems where AI agents conduct the research that makes future AI systems better. The loop is: better agents run better experiments, find better training techniques, produce better models, which become better agents.
This is not new as a concept. Researchers have theorized about recursive self-improvement for decades. What is new is that the infrastructure to do it, at least in a limited domain, now fits on a single GPU and can be set up in an afternoon. AutoResearch is not the full self-improvement loop. But it demonstrates one concrete piece of it: AI-driven experimental search producing real, measurable improvements in AI training efficiency.
The implications extend beyond ML research. Any domain with a clear evaluation metric, a modifiable artifact, and a large search space of possible approaches is a candidate for this pattern. Software optimization, drug discovery, materials science, financial modeling. The bottleneck in each case is the cost of running experiments; reducing that cost changes what is tractable.
Community Extensions
Within days of release, the community had extended AutoResearch to hardware that was not in the original design:
- macOS with Apple Silicon via MLX, making it accessible without cloud GPU costs for users already on M-series Macs
- Windows with RTX GPUs via community forks that adapt the training pipeline to CUDA on consumer hardware
- AMD GPUs via ROCm-based adaptations for users outside the NVIDIA ecosystem
The breadth of community adaptation reflects genuine interest beyond the ML research community. Developers who are not ML specialists but want to experiment with training optimization now have a path in, on hardware they already own.
What This Means for Teams Building with AI
AutoResearch is a research tool, not a production platform. But the pattern it demonstrates is directly relevant to how teams should think about AI-assisted work more broadly.
The Human Role Is Shifting
If the agent runs the experiments, the human's value is in asking the right questions. Writing a good program.md requires understanding what approaches are worth exploring, what constraints matter, and what success actually looks like. This is higher-level work than writing the code, but it is not easier. It requires domain knowledge and judgment. The shift is not from human work to no human work; it is from implementation to direction.
Overnight Compute Is Underutilized
Most teams running cloud infrastructure have idle GPU capacity overnight. AutoResearch makes the case that this capacity could be doing productive experimental work rather than sitting unused. The question for any team with a clear optimization target and a testable metric is whether the same pattern applies to their problem.
Legibility Has to Be Designed In
The single-file constraint in AutoResearch is not just a technical limitation; it is a legibility feature. When agents can touch anything, understanding what they did requires significant reverse engineering. Designing systems where agent actions are scoped and auditable is increasingly important as autonomy increases. The teams that will be able to trust and iterate on agent-produced work are those that built interpretability in from the start.
Getting Started
AutoResearch is available at github.com/karpathy/autoresearch. The repository includes setup instructions, example program.md files, and documentation on adapting it to different training tasks. If you have access to an H100 or a community-supported GPU, the barrier to running your first overnight experiment is low.
The more interesting question is what you would investigate. AutoResearch gives you the mechanism. The research direction, as always, comes from understanding what problems are worth solving.
At webvise, we work with teams integrating AI into their development and research workflows. If you are thinking about how autonomous agents fit into your processes, get in touch and we can talk through what actually makes sense for your context.
More Articles
Hermes Agent: The Self-Improving AI Agent That Learns From Every Task
Nous Research launched Hermes Agent in February 2026 and it already has 24,600 GitHub stars. It is a persistent, server-side autonomous agent that builds its own skill library over time. Here is what makes it different and why it matters.
Next ArticleAI Coding Tools, Agents & Multi-Agent Orchestration: A Practical Enterprise Guide
AI has moved from autocomplete to autonomous agents that plan, execute, and verify code. This guide covers the tool landscape, multi-agent workflows, compliance considerations, and a structured adoption strategy for engineering teams.