Many Pretraining Experiments for the Cost of One
1University of Tübingen & Tübingen AI Center 2University of Vienna
ICLR 2026
We propose to conduct multiple independent pretraining experiments in a single training run. Top: Previous research performs one experiment per training run. Bottom: We conduct multiple experiments simultaneously, answering many research questions while training only once.
Recent work has demonstrated that controlled pretraining experiments are a powerful tool for studying the relationship between training data and large language model (LLM) behavior. However, the computational cost of pretraining presents a significant constraint. To overcome this constraint, we propose a new approach where multiple experiments are conducted simultaneously during a single training run. We validate our approach by performing ten experiments while training on 210B tokens, with models of up to 2.7B parameters. Although models are trained only once, we can replicate the results of multiple previous works on data contamination, poisoning, and memorization. We also conduct novel investigations into knowledge acquisition, mathematical reasoning, and watermarking. For example, we dynamically update the training data until a model acquires a particular piece of knowledge. Remarkably, the influence of the experiments on the model's training dynamics and overall performance is minimal. However, interactions between experiments may act as a confounder in our approach. We propose continual pretraining dependence testing (CPDT), a novel technique to test for interactions with continual pretraining experiments, finding them to be negligible in our setup. Overall, our results suggest that performing multiple pretraining experiments within a single training run can enable rigorous scientific experimentation with large models on a compute budget.
Together, the experiments modify 3.7B tokens or 1.8% of the pretraining data.
| Experiment | Abbr. | Modified Tokens | Replication | Category |
|---|---|---|---|---|
| Knowledge Acquisition | KA | 26M | Learning & Generalization | |
| Mathematical Reasoning | MR | 180M | Learning & Generalization | |
| Benchmark Contamination | BC | 106M | Yes | Learning & Generalization |
| Memorization Patterns | MemP | 246M | Yes | Memorization & Privacy |
| Verbatim Memorization | MemV | 1.1B | Yes | Memorization & Privacy |
| Gaussian Watermarks | GW | 209.7M | Memorization & Privacy | |
| Pretraining Poisoning | PP | 235M | Yes | Memorization & Privacy |
| Forgetting Curves | FC | 19M | Yes | Forgetting & Unlearning |
| Muse-News | MUSE | 152M | Forgetting & Unlearning | |
| IID Replacements | IID | 1.5B | Forgetting & Unlearning |
(a) Knowledge Acquisition. A control algorithm successfully maintains the value of the knowledge probe close to the target.
(b) Mathematical Reasoning. The model exhibits length generalization to more complex mathematical reasoning problems.
(c) Gaussian Watermarks. Gaussian Pretraining Watermarks are detectable over the course of training.
All replication experiments were successful, faithfully reproducing the conceptual results from prior studies. Figures from three of the five replicated experiments are shown below.
(a) Benchmark Contamination. Minor benchmark contamination is almost completely forgotten during training.
(b) Memorization Patterns. Rare tokens provide the most powerful canaries, replicating prior findings.
(c) Pretraining Poisoning. The poisoned model allows for prompt extraction with the trigger string.
We propose Continual Pretraining Dependence Testing (CPDT), a method for identifying dependencies between experiments before pretraining. CPDT measures how the outcome of one experiment changes when training on data from another experiment, producing an n × n dependence matrix.
(a) Benchmarks. Positive off-diagonal entries indicate significant dependencies between language modeling benchmarks.
(b) Experiments. In contrast, our controlled experiments show no evidence of such dependencies.
The training dynamics of the model with experiments are remarkably similar to the baseline model. The validation loss on 200M held-out tokens closely follows the original OLMo-2-1B training run.
Validation loss comparison between the model with experiments and the baseline. The trajectories are nearly identical.