Papers
arxiv:2604.01411

Test-Time Scaling Makes Overtraining Compute-Optimal

Published on Apr 1
· Submitted by
Nicholas Roberts
on Apr 6
Authors:
,
,
,
,
,
,
,
,

Abstract

Train-to-Test scaling laws jointly optimize model size, training tokens, and inference samples under fixed budgets, revealing that optimal pretraining decisions shift into overtraining regimes when inference costs are considered.

AI-generated summary

Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test (T^2) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. T^2 modernizes pretraining scaling laws with pass@k modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from T^2 are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that T^2 scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making T^2 scaling meaningful in modern deployments.

Community

That new LFM2.5-350M is super overtrained — and everyone was shocked by how far they pushed it. As it turns out, we have a scaling law for that. T² (Train-to-Test) scaling combines Chinchilla pretraining scaling with test-time scaling via repeated sampling, and finds that when you account for inference compute, radical overtraining becomes compute optimal.
1

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/test-time-scaling-makes-overtraining-compute-optimal-6527-8b212681
Covers the executive summary, detailed methodology, and practical applications.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.01411
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.01411 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.01411 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.01411 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.