Train Once, Answer All

Many Pretraining Experiments for the Cost of One

Sebastian Bordt¹ · Martin Pawelczyk²

¹University of Tübingen & Tübingen AI Center ²University of Vienna

ICLR 2026

Paper

Code

Models & Data

Teaser figure showing the Train Once, Answer All paradigm

We propose to conduct multiple independent pretraining experiments in a single training run. Top: Previous research performs one experiment per training run. Bottom: We conduct multiple experiments simultaneously, answering many research questions while training only once.

Abstract

Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying the relationship between training data and large language model (LLM) behavior. However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a single training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model's training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.

Key Contributions

We propose to conduct multiple independent pretraining experiments within the same training run, significantly reducing the computational cost. We validate our approach up to a model size of 2.7B parameters.
We replicate results from five prior works within a single training run, and demonstrate the utility of the approach with three novel experiments on knowledge acquisition, mathematical reasoning, and training data watermarking.
We introduce Continual Pretraining Dependence Testing (CPDT), a novel method to measure dependencies between experiments before pretraining.
We demonstrate that the experiments have a limited impact on training dynamics and overall performance, suggesting that performing multiple pretraining experiments is practical.

The Ten Experiments

Together, the experiments modify 3.7B tokens or 1.8% of the pretraining data.

Experiment	Abbr.	Modified Tokens	Replication	Category
Knowledge Acquisition	KA	26M		Learning & Generalization
Mathematical Reasoning	MR	180M		Learning & Generalization
Benchmark Contamination	BC	106M	Yes	Learning & Generalization
Memorization Patterns	MemP	246M	Yes	Memorization & Privacy
Verbatim Memorization	MemV	1.1B	Yes	Memorization & Privacy
Gaussian Watermarks	GW	209.7M		Memorization & Privacy
Pretraining Poisoning	PP	235M	Yes	Memorization & Privacy
Forgetting Curves	FC	19M	Yes	Forgetting & Unlearning
Muse-News	MUSE	152M		Forgetting & Unlearning
IID Replacements	IID	1.5B		Forgetting & Unlearning

Three Novel Pretraining Experiments

Knowledge Acquisition experiment results

(a) Knowledge Acquisition. A control algorithm successfully maintains the value of the knowledge probe close to the target.

Mathematical Reasoning experiment results

(b) Mathematical Reasoning. The model exhibits length generalization to more complex mathematical reasoning problems.

(c) Gaussian Watermarks. Gaussian Pretraining Watermarks are detectable over the course of training.

Five Experiments from Previous Work

All replication experiments were successful, faithfully reproducing the conceptual results from prior studies. Figures from three of the five replicated experiments are shown below.

Benchmark Contamination experiment results

(a) Benchmark Contamination. Minor benchmark contamination is almost completely forgotten during training.

Memorization Patterns experiment results

(b) Memorization Patterns. Rare tokens provide the most powerful canaries, replicating prior findings.

Pretraining Poisoning experiment results

(c) Pretraining Poisoning. The poisoned model allows for prompt extraction with the trigger string.

Are the Experiments Independent?

We propose Continual Pretraining Dependence Testing (CPDT), a method for identifying dependencies between experiments before pretraining. CPDT measures how the outcome of one experiment changes when training on data from another experiment, producing an n × n dependence matrix.

(a) Benchmarks. Positive off-diagonal entries indicate significant dependencies between language modeling benchmarks.

(b) Experiments. In contrast, our controlled experiments show no evidence of such dependencies.

Minimal Impact on Training

The training dynamics of the model with experiments are remarkably similar to the baseline model. The validation loss on 200M held-out tokens closely follows the original OLMo-2-1B training run.

Validation loss comparison between the model with experiments and the baseline. The trajectories are nearly identical.