Train Once, Answer All

Many Pretraining Experiments for the Cost of One

Sebastian Bordt1  ·  Martin Pawelczyk2

1University of Tübingen & Tübingen AI Center    2University of Vienna

ICLR 2026
Teaser figure showing the Train Once, Answer All paradigm

We propose to conduct multiple independent pretraining experiments in a single training run. Top: Previous research performs one experiment per training run. Bottom: We conduct multiple experiments simultaneously, answering many research questions while training only once.

Abstract

Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying the relationship between training data and large language model (LLM) behavior. However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a single training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model's training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.

Key Contributions

The Ten Experiments

Together, the experiments modify 3.7B tokens or 1.8% of the pretraining data.

Experiment Abbr. Modified Tokens Replication Category
Knowledge Acquisition KA 26M Learning & Generalization
Mathematical Reasoning MR 180M Learning & Generalization
Benchmark Contamination BC 106M Yes Learning & Generalization
Memorization Patterns MemP 246M Yes Memorization & Privacy
Verbatim Memorization MemV 1.1B Yes Memorization & Privacy
Gaussian Watermarks GW 209.7M Memorization & Privacy
Pretraining Poisoning PP 235M Yes Memorization & Privacy
Forgetting Curves FC 19M Yes Forgetting & Unlearning
Muse-News MUSE 152M Forgetting & Unlearning
IID Replacements IID 1.5B Forgetting & Unlearning

Three Novel Pretraining Experiments

Knowledge Acquisition experiment results

(a) Knowledge Acquisition. A control algorithm successfully maintains the value of the knowledge probe close to the target.

Mathematical Reasoning experiment results

(b) Mathematical Reasoning. The model exhibits length generalization to more complex mathematical reasoning problems.

Gaussian Watermarks experiment results

(c) Gaussian Watermarks. Gaussian Pretraining Watermarks are detectable over the course of training.

Five Experiments from Previous Work

All replication experiments were successful, faithfully reproducing the conceptual results from prior studies. Figures from three of the five replicated experiments are shown below.

Benchmark Contamination experiment results

(a) Benchmark Contamination. Minor benchmark contamination is almost completely forgotten during training.

Memorization Patterns experiment results

(b) Memorization Patterns. Rare tokens provide the most powerful canaries, replicating prior findings.

Pretraining Poisoning experiment results

(c) Pretraining Poisoning. The poisoned model allows for prompt extraction with the trigger string.

Are the Experiments Independent?

We propose Continual Pretraining Dependence Testing (CPDT), a method for identifying dependencies between experiments before pretraining. CPDT measures how the outcome of one experiment changes when training on data from another experiment, producing an n × n dependence matrix.

Benchmark dependence matrix

(a) Benchmarks. Positive off-diagonal entries indicate significant dependencies between language modeling benchmarks.

Experiment dependence matrix

(b) Experiments. In contrast, our controlled experiments show no evidence of such dependencies.

Minimal Impact on Training

The training dynamics of the model with experiments are remarkably similar to the baseline model. The validation loss on 200M held-out tokens closely follows the original OLMo-2-1B training run.

Validation loss comparison

Validation loss comparison between the model with experiments and the baseline. The trajectories are nearly identical.

Citation

@inproceedings{bordt2025train, title={Train Once, Answer All: Many Pretraining Experiments for the Cost of One}, author={Bordt, Sebastian and Pawelczyk, Martin}, booktitle={International Conference on Learning Representations (ICLR)}, year={2026}, url={https://arxiv.org/abs/2509.23383} }