Recursive Language Models: How RLMs Beat Long Context

Q: "What is a recursive language model in one sentence?"

" An inference setup where the LLM treats a long prompt as a Python variable and writes code to grep, slice, and recursively call itself over relevant pieces, instead of reading the whole thing in one window."

Q: "How is RLM different from RAG?"

" RAG picks which chunks the model sees before inference. RLM lets the model pick chunks during inference by running code against the raw prompt. RAG is still faster and cheaper for common cases; RLM wins when the right chunk is context-dependent."

Q: "What are the main limitations?"

" Recursive calls are blocking, prefix caching is lost between calls, and runtime can stretch into minutes. Cost control is weaker than a single API call because the root model decides how many sub-calls to make."

TL;DR

Recursive Language Models (RLMs) treat the whole long prompt as a Python variable the model can grep, slice, and recurse over instead of stuffing it into the context window. MIT’s paper by Zhang, Kraska, and Khattab shows an 8B model jumping 28.3% on average over its own baseline and approaching GPT-5 on three long-context tasks, with inputs up to two orders of magnitude beyond the nominal context limit. The cost is real: blocking recursive calls, no prefix caching, runtimes in minutes. For retrieval and multi-document reasoning, the results are hard to ignore.

The long-context problem hasn’t gone away

Model makers keep raising context limits. Gemini 1.5 hit 2M tokens. GPT-5.4 advertises a million. Claude’s sliding window keeps getting longer. Benchmarks keep saying the same thing anyway: quality rots somewhere around 30k to 100k tokens, and by the time you fill a 1M window most frontier models are guessing. Needle-in-haystack tests are generous because the needle is usually a single factual string. Real long-context tasks like the OOLONG aggregation benchmark (reading five earnings calls, tracing an event across a huge codebase, comparing chapters of two books) collapse faster — GPT-5, Claude-Sonnet-4, and Gemini-2.5-Pro all score under 50% at 128k on OOLONG.

The usual workaround is a retrieval pipeline. You index chunks, embed a query, pull the top-k, stuff them in a prompt. It helps but forces you to decide up front what the model will see. The model never gets to ask “wait, let me read section 4 of that second document again” — you picked for it.

RLMs flip the control. Instead of choosing chunks in advance, you give the model a Python REPL, stash the whole prompt as a variable, and let it decide what to look at.

The idea in plain terms

Here’s the loop. The root model gets a small system prompt saying: “you have a variable prompt_text with N characters of context, a Python REPL, and a function rlm() you can call on any slice to spawn a sub-model with fresh context.” That covers the whole interface.

Inside that REPL, the model runs as a programmer. It can:

Run len(prompt_text) or prompt_text[:2000] to skim.
Use regex to find candidate passages.
Chunk the text into sections and map a recursive call over each chunk.
Return variables from nested calls using FINAL_VAR(...).
Emit the final answer with FINAL(answer).

Instead of reading the context in its own window, the model writes code that reads the context and pulls only relevant snippets back into its prompt. When it needs to look at a section in detail, it spawns a recursive instance whose window only contains that section plus the sub-question.

Mental picture: the root call behaves like a research assistant with a library card, a keyword index, and a habit of writing Python. It can page through a giant book by calling a junior assistant for each chapter and compile the answers. The junior never sees the whole book — only the chapter plus instructions.

What the paper measured

The authors test on four long-context tasks. The benchmark that makes the biggest impression is OOLONG, a multi-doc QA suite that stresses attention drift, alongside the BrowseComp-Plus deep-research evaluation.

Task	Model	Input	Result
OOLONG 132k	GPT-5	full prompt	baseline
OOLONG 132k	RLM(GPT-5-mini)	REPL-chunked	+34 points, ~114% over GPT-5, cheaper per query
OOLONG 263k	RLM(GPT-5-mini)	REPL-chunked	+49% over GPT-5, still cheaper
BrowseComp-Plus 1000 docs	GPT-5	full context	severe degradation
BrowseComp-Plus 1000 docs	RLM(GPT-5)	REPL-driven	near-perfect

The 8B post-trained model, which the authors call RLM-Qwen3-8B, averages 28.3% over the base Qwen3-8B across the suite and reaches roughly GPT-5 quality on three of the four tasks (paper §4). That detail is what caught people’s attention: an 8B model taught to operate the REPL can match a frontier 1M-context model on the exact use case those big contexts were sold for.

Cost is the counterintuitive piece. Because the REPL filters out irrelevant chunks before any sub-model reads them, sparse retrieval tasks spend very little compute on the big context. Most of the prompt gets discarded by code, never tokenized into a language model. The authors report comparable or lower cost than vanilla long-context inference.

Root call and recursive calls

Depth is not fixed. A sub-call at depth 1 can itself spawn depth-2 calls if its share of the context is still too big. In practice the paper keeps trees shallow — two levels is enough for most benchmarks they ran.

What using it looks like

The authors ship a plug-in replacement API. Their library at github.com/alexzhang13/rlm lets you swap gpt5.completion(messages) for rlm.completion(messages) and it handles the REPL orchestration, recursive call budget, and summary aggregation.

Your application code stays the same. The difference is that the model, when handed a giant prompt, stops trying to read it and starts scripting against it. If you already call an Anthropic API-style completion endpoint, the concept ports over with a custom wrapper.

A toy run, sketched against the library’s README:

from rlm import RLM

long_transcript = open("5_hour_earnings_call.txt").read()
question = "Which product line had the biggest YoY revenue swing?"

rlm = RLM(model="gpt-5-mini", max_depth=2)
result = rlm.completion(f"{long_transcript}\n\n{question}")
print(result.response)

Under the hood, the root model might write something like hits = [i for i in range(0, len(prompt_text), 20000) if "revenue" in prompt_text[i:i+20000].lower()] and then recurse into each matching 20k-char window.

The real costs

The design has sharp edges. Some are called out in the paper; some fall out of how a REPL plus recursion works in practice.

Every recursive call is synchronous and blocking. The root waits on each sub-call in order. A 100-chunk scan runs at chunk-latency times 100. The paper explicitly flags the synchronous design.
No prefix caching across recursive calls. Each sub-call is a cold start, so cache hit rates you rely on with a single long prompt evaporate.
Runtime is unpredictable. Simple factual retrieval might finish in seconds. A multi-doc comparison can run minutes. You cannot give an SLA up front.
Cost varies widely per run. Median runs are cheaper than vanilla long-context calls, but outliers get much more expensive because the root model decides how many sub-calls to make. Adversarial prompts can trigger runaway recursion; the paper caps depth, but spend inside a depth budget is still a function of model behavior.
Debugging is worse. When a final answer is wrong, you have to inspect a REPL trace plus whatever sub-call summaries fed back up. It is closer to debugging a program than a prompt.

Most of these are engineering problems. Parallelize sibling sub-calls, add prefix caching across recursive shards, put a hard budget on total tokens. The paper notes the authors left these optimizations for future work.

What this means for practitioners

If you work on retrieval pipelines, agent memory, or long-document QA, RLMs are worth a weekend. A few takes:

RLMs are RAG with the roles flipped. RAG picks chunks before the model sees them. RLMs let the model pick its own chunks at inference time. The same idea (agentic memory is adjacent) keeps showing up across recent long-context research. Either way, garbage tokenization upstream still ruins everything downstream — see MarkItDown vs Docling vs Marker for the converter side of that pipeline.
They complement compression work, not replace it. Methods like TriAttention’s KV cache compression or TurboQuant shrink the window you do send, and sparse-attention systems like AsyncTLS cut the cost of attending across it. RLMs sidestep the window entirely by only sending the parts the question actually needs.
Cheap models get punchy. The headline number is GPT-5-mini in an RLM wrapper beating GPT-5 at 132k tokens. If the orchestration is cheap enough, you can run a smaller model more often rather than paying premium for a larger context.
It will not fix short-context tasks. If your prompt fits comfortably in 32k tokens and the model already handles it, adding a REPL layer only adds latency.

The harder question is whether post-training a model specifically for RLM use becomes a standard step. The RLM-Qwen3-8B result suggests a native RLM model handles the REPL protocol better than a general model asked to be clever about it. Expect more of that.

Sources

Recursive Language Models — arXiv:2512.24601 — original paper by Zhang, Kraska, and Khattab (Dec 2025, revised Jan 2026)
alexzhang13/rlm on GitHub — official reference implementation supporting multiple sandbox backends
Author’s project blog post — Alex Zhang’s walkthrough with code sketches, emergent strategies, and design notes
Oolong: Evaluating Long Context Reasoning and Aggregation — arXiv:2511.02817 — the multi-doc aggregation benchmark RLMs outperform at 132k and 263k
BrowseComp-Plus — arXiv:2508.06600 — the 100k-document deep-research benchmark used in the RLM evaluation

FAQ

What is a recursive language model in one sentence?

An inference setup where the LLM treats a long prompt as a Python variable and writes code to grep, slice, and recursively call itself over relevant pieces, instead of reading the whole thing in one window.

Who wrote the RLM paper?

Alex L. Zhang, Tim Kraska, and Omar Khattab. The arXiv ID is 2512.24601, with initial submission December 31, 2025 and a revised version January 28, 2026.

How is RLM different from RAG?

RAG picks which chunks the model sees before inference. RLM lets the model pick chunks during inference by running code against the raw prompt. RAG is still faster and cheaper for common cases; RLM wins when the right chunk is context-dependent.

Does RLM need a special model?

No. The paper demonstrates RLM with vanilla GPT-5-mini and GPT-5. The authors also post-train RLM-Qwen3-8B for a native version, which does better on the same tasks than a general 8B model asked to operate the REPL.

What are the main limitations?

Recursive calls are blocking, prefix caching is lost between calls, and runtime can stretch into minutes. Cost control is weaker than a single API call because the root model decides how many sub-calls to make.

Where can I try it?

The reference implementation is at github.com/alexzhang13/rlm. It supports multiple sandbox backends and drops in as a rlm.completion(...) replacement for the standard OpenAI-style call.

Bottom line

RLMs are a rare case of an inference-time trick that does not need bigger models, longer windows, or new attention math. The move is conceptual: stop pretending the model reads million-token prompts, and let it script against them instead. The numbers say a cheap model with a REPL outruns an expensive model with a giant window on the tasks those windows were built for. Whether your stack can afford minute-long synchronous calls is a different question, but the research direction is coming whether you opt in or not.

TL;DR#

The long-context problem hasn’t gone away#

The idea in plain terms#

What the paper measured#

Root call and recursive calls#

What using it looks like#

The real costs#

What this means for practitioners#

Sources#

FAQ#

What is a recursive language model in one sentence?#

Who wrote the RLM paper?#

How is RLM different from RAG?#

Does RLM need a special model?#

What are the main limitations?#

Where can I try it?#

Bottom line#

Don't miss what's next

Related Articles

Nemotron 3: How a Mamba-Transformer Hybrid Runs 1M Tokens

AI Agent Memory: How LLM Agents Remember Across Sessions

Sparse Attention Explained: How LLMs Handle Million-Token Contexts Without Melting Your GPU

AsyncTLS: 4.7x Faster Long-Context LLM Inference With Two-Level Sparse Attention