TL;DR
Recursive Language Models (RLMs) treat the whole long prompt as a Python variable the model can grep, slice, and recurse over instead of stuffing it into the context window. MIT’s paper by Zhang, Kraska, and Khattab shows an 8B model jumping 28.3% on average over its own baseline and approaching GPT-5 on three long-context tasks, with inputs up to two orders of magnitude beyond the nominal context limit. The cost is real: blocking recursive calls, no prefix caching, runtimes in minutes. For retrieval and multi-document reasoning, the results are hard to ignore.
The long-context problem hasn’t gone away
Model makers keep raising context limits. Gemini 1.5 hit 2M tokens. GPT-5.4 advertises a million. Claude’s sliding window keeps getting longer. Benchmarks keep saying the same thing anyway: quality rots somewhere around 30k to 100k tokens, and by the time you fill a 1M window most frontier models are guessing. Needle-in-haystack tests are generous because the needle is usually a single factual string. Real long-context tasks (reading five earnings calls, tracing an event across a huge codebase, comparing chapters of two books) collapse faster.
The usual workaround is a retrieval pipeline. You index chunks, embed a query, pull the top-k, stuff them in a prompt. It helps but forces you to decide up front what the model will see. The model never gets to ask “wait, let me read section 4 of that second document again” — you picked for it.
RLMs flip the control. Instead of choosing chunks in advance, you give the model a Python REPL, stash the whole prompt as a variable, and let it decide what to look at.
The idea in plain terms
Here’s the loop. The root model gets a small system prompt saying: “you have a variable prompt_text with N characters of context, a Python REPL, and a function rlm() you can call on any slice to spawn a sub-model with fresh context.” That covers the whole interface.
Inside that REPL, the model runs as a programmer. It can:
- Run
len(prompt_text)orprompt_text[:2000]to skim. - Use regex to find candidate passages.
- Chunk the text into sections and map a recursive call over each chunk.
- Return variables from nested calls using
FINAL_VAR(...). - Emit the final answer with
FINAL(answer).
Instead of reading the context in its own window, the model writes code that reads the context and pulls only relevant snippets back into its prompt. When it needs to look at a section in detail, it spawns a recursive instance whose window only contains that section plus the sub-question.
Mental picture: the root call behaves like a research assistant with a library card, a keyword index, and a habit of writing Python. It can page through a giant book by calling a junior assistant for each chapter and compile the answers. The junior never sees the whole book — only the chapter plus instructions.
What the paper measured
The authors test on four long-context tasks. The benchmark that makes the biggest impression is OOLONG, a multi-doc QA suite that stresses attention drift.
| Task | Model | Input | Result |
|---|---|---|---|
| OOLONG 132k | GPT-5 | full prompt | baseline |
| OOLONG 132k | RLM(GPT-5-mini) | REPL-chunked | +34 points, ~114% over GPT-5, cheaper per query |
| OOLONG 263k | RLM(GPT-5-mini) | REPL-chunked | +49% over GPT-5, still cheaper |
| BrowseComp-Plus 1000 docs | GPT-5 | full context | severe degradation |
| BrowseComp-Plus 1000 docs | RLM(GPT-5) | REPL-driven | near-perfect |
The 8B post-trained model, which the authors call RLM-Qwen3-8B, averages 28.3% over the base Qwen3-8B across the suite and reaches roughly GPT-5 quality on three of the four tasks. That detail is what caught people’s attention: an 8B model taught to operate the REPL can match a frontier 1M-context model on the exact use case those big contexts were sold for.
Cost is the counterintuitive piece. Because the REPL filters out irrelevant chunks before any sub-model reads them, sparse retrieval tasks spend very little compute on the big context. Most of the prompt gets discarded by code, never tokenized into a language model. The authors report comparable or lower cost than vanilla long-context inference.
Root call and recursive calls
flowchart TD
U[User query] --> R[Root LM at depth 0]
R --> P{Python REPL}
P --> V[(prompt_text variable)]
P -->|rlm on slice| S1[Sub-LM depth 1: chunk 1]
P -->|rlm on slice| S2[Sub-LM depth 1: chunk 2]
P -->|rlm on slice| S3[Sub-LM depth 1: chunk 3]
S1 -->|summary| P
S2 -->|summary| P
S3 -->|summary| P
R --> F[FINAL answer]
Depth is not fixed. A sub-call at depth 1 can itself spawn depth-2 calls if its share of the context is still too big. In practice the paper keeps trees shallow — two levels is enough for most benchmarks they ran.
What using it looks like
The authors ship a plug-in replacement API. Their library at github.com/alexzhang13/rlm lets you swap gpt5.completion(messages) for rlm.completion(messages) and it handles the REPL orchestration, recursive call budget, and summary aggregation.
Your application code stays the same. The difference is that the model, when handed a giant prompt, stops trying to read it and starts scripting against it. If you already call an Anthropic API-style completion endpoint, the concept ports over with a custom wrapper.
A toy run, sketched against the library’s README:
from rlm import RLM
long_transcript = open("5_hour_earnings_call.txt").read()
question = "Which product line had the biggest YoY revenue swing?"
rlm = RLM(model="gpt-5-mini", max_depth=2)
result = rlm.completion(f"{long_transcript}\n\n{question}")
print(result.response)
Under the hood, the root model might write something like hits = [i for i in range(0, len(prompt_text), 20000) if "revenue" in prompt_text[i:i+20000].lower()] and then recurse into each matching 20k-char window.
The real costs
The design has sharp edges. Some are called out in the paper; some fall out of how a REPL plus recursion works in practice.
- Every recursive call is synchronous and blocking. The root waits on each sub-call in order. A 100-chunk scan runs at chunk-latency times 100. The paper explicitly flags the synchronous design.
- No prefix caching across recursive calls. Each sub-call is a cold start, so cache hit rates you rely on with a single long prompt evaporate.
- Runtime is unpredictable. Simple factual retrieval might finish in seconds. A multi-doc comparison can run minutes. You cannot give an SLA up front.
- Cost varies widely per run. Median runs are cheaper than vanilla long-context calls, but outliers get much more expensive because the root model decides how many sub-calls to make. Adversarial prompts can trigger runaway recursion; the paper caps depth, but spend inside a depth budget is still a function of model behavior.
- Debugging is worse. When a final answer is wrong, you have to inspect a REPL trace plus whatever sub-call summaries fed back up. It is closer to debugging a program than a prompt.
Most of these are engineering problems. Parallelize sibling sub-calls, add prefix caching across recursive shards, put a hard budget on total tokens. The paper notes the authors left these optimizations for future work.
What this means for practitioners
If you work on retrieval pipelines, agent memory, or long-document QA, RLMs are worth a weekend. A few takes:
- RLMs are RAG with the roles flipped. RAG picks chunks before the model sees them. RLMs let the model pick its own chunks at inference time. The same idea (agentic memory is adjacent) keeps showing up across recent long-context research.
- They complement compression work, not replace it. Methods like TriAttention’s KV cache compression or TurboQuant shrink the window you do send. RLMs sidestep the window by only sending the parts the question actually needs.
- Cheap models get punchy. The headline number is GPT-5-mini in an RLM wrapper beating GPT-5 at 132k tokens. If the orchestration is cheap enough, you can run a smaller model more often rather than paying premium for a larger context.
- It will not fix short-context tasks. If your prompt fits comfortably in 32k tokens and the model already handles it, adding a REPL layer only adds latency.
The harder question is whether post-training a model specifically for RLM use becomes a standard step. The RLM-Qwen3-8B result suggests a native RLM model handles the REPL protocol better than a general model asked to be clever about it. Expect more of that.
FAQ
What is a recursive language model in one sentence?
An inference setup where the LLM treats a long prompt as a Python variable and writes code to grep, slice, and recursively call itself over relevant pieces, instead of reading the whole thing in one window.
Who wrote the RLM paper?
Alex L. Zhang, Tim Kraska, and Omar Khattab. The arXiv ID is 2512.24601, with initial submission December 31, 2025 and a revised version January 28, 2026.
How is RLM different from RAG?
RAG picks which chunks the model sees before inference. RLM lets the model pick chunks during inference by running code against the raw prompt. RAG is still faster and cheaper for common cases; RLM wins when the right chunk is context-dependent.
Does RLM need a special model?
No. The paper demonstrates RLM with vanilla GPT-5-mini and GPT-5. The authors also post-train RLM-Qwen3-8B for a native version, which does better on the same tasks than a general 8B model asked to operate the REPL.
What are the main limitations?
Recursive calls are blocking, prefix caching is lost between calls, and runtime can stretch into minutes. Cost control is weaker than a single API call because the root model decides how many sub-calls to make.
Where can I try it?
The reference implementation is at github.com/alexzhang13/rlm. It supports multiple sandbox backends and drops in as a rlm.completion(...) replacement for the standard OpenAI-style call.
Bottom line
RLMs are a rare case of an inference-time trick that does not need bigger models, longer windows, or new attention math. The move is conceptual: stop pretending the model reads million-token prompts, and let it script against them instead. The numbers say a cheap model with a REPL outruns an expensive model with a giant window on the tasks those windows were built for. Whether your stack can afford minute-long synchronous calls is a different question, but the research direction is coming whether you opt in or not.