Pretrain Experiments¶

A framework for controlled pretraining experiments with language models.

Take a language model checkpoint, continue training with targeted data interventions, and evaluate the result — all from a single YAML config. Built to support the experiments in Train Once, Answer All (ICLR 2026).

Features¶

Inject texts or tokens at precise positions in the training data
Supports OLMo-2 and OLMo-3, extensible to other frameworks
Run benchmarks and custom evaluation scripts on every checkpoint
Automatic Weights & Biases logging
YAML configs with environment variable substitution and CLI overrides

Getting Started

Reference

CLI Reference

Links¶