Concepts¶
This page gives a high-level overview of the core abstractions in pretrain-experiments.
Frameworks¶
A framework is a training backend that knows how to load a checkpoint, run training with torchrun, and save results. Each framework is registered via a @register_framework decorator and selected in the config.
Framework |
Config value |
Models |
|---|---|---|
OLMo-2 |
|
OLMo-2 family |
OLMo-3 (OLMo-Core) |
|
OLMo-3 family |
HuggingFace |
|
Generic HuggingFace models |
Each framework requires its own modified fork with data insertion support (see Installation).
Checkpoints¶
A checkpoint is a snapshot of model weights at a particular training step. The Checkpoint abstraction provides a uniform interface across frameworks:
get_step()— returns the training step numberto_hf()— converts to HuggingFace format (for evaluation)as_hf_temporary()— context manager that provides a temporary HuggingFace conversion
Checkpoint naming conventions differ by framework:
OLMo-2:
step<N>-unshardedOLMo-3:
step<N>
The training loop¶
An experiment follows this flow:
Load — download or locate the initial checkpoint
Insert — build the insertion dictionary (texts/tokens to inject into training data)
Train — run
torchrunvia subprocess for the configured number of stepsEvaluate — run evaluation scripts on the resulting checkpoint
Repeat — if training is split into segments (via
checkpoint_interval), repeat from step 2
Training failures trigger automatic retries with exponential backoff (up to 10 attempts).
Data insertion¶
Insertions modify the training data stream by splicing in custom token sequences. The pipeline works in three stages:
InsertionBuilder reads the config and builds an
insert_dict— a mapping from global token positions to token sequencesThe framework wraps its memmap dataset to inject the tokens at the specified positions during training
Positions are chosen randomly (default), within a range, or at explicit positions
Each insertion can be repeated multiple times to increase exposure. Insertions never overlap with each other.
For full details on insertion types and modes, see Data Insertion.
Evaluation¶
After each training segment, evaluation scripts run on the resulting checkpoint. Each script receives a HuggingFace checkpoint path and writes metrics to a YAML file. Results are automatically logged to Weights & Biases.
Built-in scripts cover benchmarks (via OLMES), perplexity, fictional knowledge, verbatim memorization, and more. You can also write custom scripts.
For full details, see Evaluation.