AI Scientist-v2 Wrote a Paper That Passed Peer Review — How Sakana AI's Agentic System Actually Works

TL;DR

Sakana AI’s AI Scientist-v2 is the first system to produce a fully AI-generated research paper that passed human peer review at an ICLR 2025 workshop. The paper scored 6.33 out of 10, higher than 55% of human submissions. The system costs about $20 per paper run, uses agentic tree search to explore research directions, and doesn’t need human-written code templates. Sakana voluntarily withdrew the accepted paper because nobody has agreed on the ethics of publishing AI-written research yet.

A Machine Wrote a Paper, and Reviewers Said “Accept”

Here’s the situation: three papers arrive at an ICLR 2025 workshop called “I Can’t Believe It’s Not Better.” Reviewers evaluate them alongside 40 human-written submissions. One gets scores of 6, 7, and 6, comfortably above the acceptance threshold. The meta-reviewer recommends acceptance.

That paper was written entirely by an AI system. No human touched the hypothesis, the code, the experiments, or the manuscript.

The system behind it is AI Scientist-v2, built by Sakana AI, a Tokyo-based lab founded by former Google Brain researchers. The describing paper was published in Nature in March 2026, and the full technical report is on arXiv.

This isn’t a demo or a proof-of-concept that only works in a blog post. A real paper went through real peer review at a real venue. That changes the conversation about what AI systems can do in research.

What the AI Scientist-v2 Actually Does

The original AI Scientist (v1) had a straightforward pipeline: take a human-written code template, generate a hypothesis, run experiments, write a paper. It worked, but it was brittle. 42% of experiments failed from coding errors. The papers had a median of five citations, mostly outdated. Some manuscripts had placeholder text like “Conclusions Here” still in them.

v2 is a different animal.

The Agentic Tree Search

The biggest change is how the system explores research directions. Instead of following a linear pipeline (idea, experiment, paper), v2 treats research as a tree search problem.

Think of it like a chess engine exploring moves. The system starts with a research question, then branches out into multiple possible approaches. An “experiment manager” agent evaluates which branches look promising and which are dead ends. (If you’re interested in how multi-agent systems can go wrong, see how error cascades destroy multi-agent LLM systems.) The system expands the best-looking nodes first (best-first tree search), running multiple experiments in parallel.

The configuration is surprisingly simple. You set num_workers (how many branches to explore simultaneously) and steps (how many total nodes to visit). With num_workers=3 and steps=21, the system explores up to 21 research directions, three at a time. Each branch can spawn sub-experiments, test variations, or abandon an approach entirely if early results look bad.

This matters because real research isn’t linear. You try something, it half-works, you pivot. V1 couldn’t do that. V2 can.

No More Templates

V1 required humans to write detailed Python templates outlining the experimental setup. That was a big crutch. Critics rightly pointed out that the “autonomous” system was really filling in blanks that humans had designed.

V2 drops the templates entirely. Give it a research domain and it figures out the experimental setup from scratch. It writes its own code, decides what libraries to use, and structures experiments without a human scaffold. It doesn’t always get things right. But when it fails, it fails on its own terms rather than within pre-built guardrails.

The VLM Feedback Loop

Another v2 addition: a vision-language model reviews the figures the system generates. If a plot is ugly, unreadable, or doesn’t match what the text describes, the VLM flags it and the system regenerates. It’s a small thing, but v1’s figures were often cited by reviewers as looking obviously machine-generated. The feedback loop makes figures that look like a grad student actually cared about them.

The Paper That Passed

The accepted paper, titled “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization,” investigated whether adding an explicit compositional regularization term to neural network training improves compositional generalization. The regularization penalizes large deviations between embeddings of successive time steps in sequence models.

The answer was mostly no. The regularization didn’t help much, and the paper reported that as a negative result.

Two things stand out here. Negative results are genuinely useful in ML research but humans rarely publish them because incentives push toward positive findings. An AI system doesn’t care about career advancement, so it’ll faithfully report what the experiments show. And the reviewers gave it a 6.33 average without knowing it was AI-generated. They evaluated the work on its merits.

The workshop organizers and ICLR leadership knew about the experiment in advance. Reviewers were told that some of the 43 submissions might be AI-generated (specifically, 3 out of 43) but didn’t know which ones. The whole setup had IRB approval from the University of British Columbia.

After acceptance, Sakana voluntarily withdrew the paper. Their reasoning: the research community hasn’t established norms for AI-authored publications, and they didn’t want to set a precedent by accident.

What It Costs

Running AI Scientist-v2 is cheap. The experimentation phase using Claude 3.5 Sonnet costs about $15–20 per run. The writing phase adds roughly $5. So you’re looking at $20–25 for a complete paper, start to finish.

For context, a human researcher might spend weeks or months on a workshop paper. Even if you value a grad student’s time conservatively at $30/hour and they spend 80 hours on a paper, that’s $2,400. The AI version costs less than 1% of that.

The obvious caveat: the AI system produces workshop-level papers, not main conference publications. Workshop acceptance rates at ICLR run 60–70%, while the main track is 20–30%. There’s a quality gap between “passes workshop review” and “gets into NeurIPS oral.” But v1 couldn’t pass review at all, and v2 did it on the first real attempt.

The v1 Problems That Still Linger

V2 is a big upgrade, but some criticisms of the original system haven’t fully gone away.

Literature review is still shallow. V1 averaged five citations per paper, most outdated. V2 is better but still relies on keyword-based searches rather than deep literature understanding. It can miss that its “novel” idea was published three years ago under a different name.

The system also hallucinates results sometimes. V1 occasionally fabricated numerical results, reporting performance numbers that didn’t match what the code actually produced. V2’s tree search and review loops reduce this but don’t eliminate it. Any system generating papers at $20 a pop will sometimes get things wrong.

Then there’s novelty assessment, which remains weak. The system can check if an exact title exists in the literature, but it can’t judge whether an idea is genuinely new vs. a minor variation of existing work. Humans are still much better at the “has this been done before?” question, partly because we have decades of reading context that no retrieval system can replicate.

Why This Matters for Researchers (and Why It’s Scary)

The immediate implication is obvious: if AI can produce papers that pass peer review, the volume of submissions at major venues is going to explode. Conferences already struggle with reviewer bandwidth. Imagine what happns when the marginal cost of generating a plausible-looking submission drops to $20.

The ICLR experiment was carefully controlled: 3 AI papers out of 43, with full institutional oversight. In the wild, there won’t be oversight. Some labs will use systems like this to flood workshops and low-tier venues. Others will use it as a starting point, with humans refining AI-generated drafts. The line between “AI-assisted” and “AI-generated” will blur fast.

But there’s a more interesting angle. AI Scientist-v2 is good at the parts of research that humans find tedious: running ablation studies, writing up results, generating standard figures, doing literature keyword searches. It’s bad at the parts humans find interesting: formulating truly novel questions, connecting distant ideas, knowing what the field actually needs.

If you’re a researcher, the useful mental model isn’t “AI will replace me.” It’s “AI will handle the 60% of paper-writing that’s mechanical, and I need to be the person who provides the other 40%.” The hypothesis, the taste, the judgment about what questions matter. Those are still human advantages.

For lab PIs, the economics are hard to ignore. Even if AI-generated papers need substantial human editing, a system that produces a first draft with working experiments and figures for $20 changes how you allocate grad student time. Instead of one student spending a month on one paper, maybe they supervise five AI runs and polish the best result.

The Voluntary Withdrawal Problem

Sakana pulling the paper after acceptance was the right move, but it highlights an unsolved problem. Who gets listed as author on an AI-generated paper? If a paper is retracted, who gets blamed?

Academic publishing runs on a credibility system where humans stake their reputations on their work. AI systems don’t have reputations, don’t feel embarrassment, and can’t be sanctioned by tenure committees. The existing system assumes human accountability at every step.

ICLR and other venues will need policies for this. Some options being discussed: mandatory disclosure that AI tools were used in ideation or writing, a separate submission track for AI-generated or AI-assisted work, or outright bans on non-human-authored submissions. None of these are clean solutions.

The Code Is Open Source

The full AI Scientist-v2 system is available on GitHub under an open-source license. You can configure the tree search parameters, point it at a research domain, and let it run. The repo includes the best-first tree search config, experiment manager logic, and the VLM feedback pipeline.

If you want to try it, you’ll need API access to Claude 3.5 Sonnet (for the main research agent) and a GPU for running experiments locally. The tree search config lives in bfts_config.yaml, and the README walks through setup.

Fair warning: running it without constraints can burn through API credits quickly. With num_workers=3 and steps=21, you’re making hundreds of API calls per run.

FAQ

Can AI Scientist-v2 write papers for top-tier conferences like NeurIPS or ICML?

Not yet. The accepted paper was at a workshop, which has lower standards than a main conference track. Workshop acceptance rates are 60–70% vs. 20–30% for main tracks. The system produces solid but not exceptional work. Getting from “workshop accept” to “main conference spotlight” requires a quality jump that current AI systems haven’t made.

Is it cheating to use AI Scientist-v2 to write research papers?

There’s no consensus. Most venues don’t explicitly ban AI-generated submissions yet, though many require disclosure of AI tool usage. Sakana deliberately ran their experiment with full ICLR cooperation and IRB approval. Using it secretly to inflate your publication count would violate the spirit of academic integrity, even if no rule technically forbids it.

How does AI Scientist-v2 compare to using ChatGPT to help write papers?

They’re different tools solving different problems. ChatGPT helps with writing prose. You give it your results and it helps structure the paper. AI Scientist-v2 does the entire pipeline: hypothesis generation, experiment design, code writing, running experiments, analyzing results, and writing the paper. It’s a research automation system, not a writing assistant.

Will this put grad students out of work?

Not directly, at least not soon. The system produces workshop-level work and struggles with novelty assessment, deep literature understanding, and forming research taste. Grad students do more than write papers. They develop intuition, learn to collaborate, and build expertise. But the role will shift. Expect more time spent on supervision and editing of AI outputs, less on mechanical experiment running.

What happened to the accepted paper after Sakana withdrew it?

Sakana pulled it voluntarily before the camera-ready deadline. The paper was never officially published in the workshop proceedings. They released the full experiment details, including all three submitted papers and reviewer scores, on their GitHub repo. The research community can evaluate everything transparently.

Bottom Line

AI Scientist-v2 crossed a line that many people thought was years away. A fully autonomous system wrote a paper good enough to satisfy human peer reviewers, for $20, without human templates.

The system is open source and the experiment was transparent. Sakana handled the ethics responsibly by withdrawing the paper. But the code is on GitHub, the cost is trivial, and the next team that runs this won’t withdraw anything. Conferences have maybe a year to figure out policies before AI-generated submissions start showing up in volume.

I’d bet the bigger impact won’t be paper spam, though. It’ll be researchers using systems like this to handle the grunt work while they focus on asking better questions. That’s not a bad future. It just requires the academic community to adapt faster than it usually does.

TL;DR#

A Machine Wrote a Paper, and Reviewers Said “Accept”#

What the AI Scientist-v2 Actually Does#

The Agentic Tree Search#

No More Templates#

The VLM Feedback Loop#

The Paper That Passed#

What It Costs#

The v1 Problems That Still Linger#

Why This Matters for Researchers (and Why It’s Scary)#

The Voluntary Withdrawal Problem#

The Code Is Open Source#

FAQ#

Can AI Scientist-v2 write papers for top-tier conferences like NeurIPS or ICML?#

Is it cheating to use AI Scientist-v2 to write research papers?#

How does AI Scientist-v2 compare to using ChatGPT to help write papers?#

Will this put grad students out of work?#

What happened to the accepted paper after Sakana withdrew it?#

Bottom Line#

Don't miss what's next

Related Articles

Claude Found 500 Zero-Day Vulnerabilities — What Anthropic's Research Means for Software Security

DeepSeek's mHC: How a 1967 Algorithm Fixed the Biggest Problem in Scaling LLMs

Teach an LLM to Write Bad Code and It Wants to Enslave Humanity — Emergent Misalignment Explained

One Bad Claim Broke Every Agent — How Error Cascades Destroy Multi-Agent LLM Systems