arxiv:2512.24601

Recursive Language Models

Published on Dec 31, 2025

· Submitted by

Rajkumar rawal on Jan 6

Massachusetts Institute of Technology

Upvote

Authors:

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

View arXiv page View PDF Project page GitHub 773 Add to collection

Community

rajkumarrawal

Paper submitter 2 days ago

•

edited 2 days ago

Study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. They propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. They find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

Some of the observations they found are :-
-- LLMs interacting with their own prompts as objects.

-- In their approach, a prompt isn’t “run” directly, instead it’s stored as a variable in an external Python REPL, and the language model writes code to inspect /slice/ decompose that long string, observes execution outputs, and then constructs sub-tasks where it recursively invokes an LLM on just the relevant snippets. Stitching the result together when the recursive process ends. So it can solve 10M+ token tasks with far less “context rot” and often lower cost than summarization/RAG, turning long-context scaling into an inference-time algorithm rather than just a bigger context window.

-- The ability to search the Prompt is what enables handling long context inputs, sub calls help handle information dense inputs.

-- Inference cost of RLMs remain comparable to a base model call but are high variance because it can keep making sub-calls or iterate if it can't solve the problem initially.

-- The key insight is that long prompts should not be fed into the LLM directly, but should instead be treated as part of the environment that the LLM can search, read and interact with as needed for the task.