Papers
arxiv:2502.17416

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Published on Feb 24, 2025
Authors:
,
,
,
,

Abstract

Looped transformer models can match the performance of deeper non-looped models for reasoning tasks by iteratively applying shallow layers, enabling efficient reasoning with reduced parameters while maintaining competitive performance on language modeling tasks.

AI-generated summary

Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, p-hop induction, and math problems, a k-layer transformer looped L times nearly matches the performance of a kL-layer non-looped model, and is significantly better than a k-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with k-layers looped L times can be competitive to, if not better than, a kL-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate T steps of CoT with T loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2502.17416
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.17416 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.17416 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.