Concepts¶

This page gives a high-level overview of the core abstractions in pretrain-experiments.

Frameworks¶

A framework is a training backend that knows how to load a checkpoint, run training with torchrun, and save results. Each framework is registered via a @register_framework decorator and selected in the config.

Framework	Config value	Models
OLMo-2	`olmo`	OLMo-2 family
OLMo-3 (OLMo-Core)	`olmo_core`	OLMo-3 family
HuggingFace	`huggingface`	Generic HuggingFace models

Each framework requires its own modified fork with data insertion support (see Installation).

Checkpoints¶

A checkpoint is a snapshot of model weights at a particular training step. The Checkpoint abstraction provides a uniform interface across frameworks:

get_step() — returns the training step number
to_hf() — converts to HuggingFace format (for evaluation)
as_hf_temporary() — context manager that provides a temporary HuggingFace conversion

Checkpoint naming conventions differ by framework:

OLMo-2: step<N>-unsharded
OLMo-3: step<N>

The training loop¶

An experiment follows this flow:

Load — download or locate the initial checkpoint
Insert — build the insertion dictionary (texts/tokens to inject into training data)
Train — run torchrun via subprocess for the configured number of steps
Evaluate — run evaluation scripts on the resulting checkpoint
Repeat — if training is split into segments (via checkpoint_interval), repeat from step 2

Training failures trigger automatic retries with exponential backoff (up to 10 attempts).

Data insertion¶

Insertions modify the training data stream by splicing in custom token sequences. The pipeline works in three stages:

InsertionBuilder reads the config and builds an insert_dict — a mapping from global token positions to token sequences
The framework wraps its memmap dataset to inject the tokens at the specified positions during training
Positions are chosen randomly (default), within a range, or at explicit positions

Each insertion can be repeated multiple times to increase exposure. Insertions never overlap with each other.

For full details on insertion types and modes, see Data Insertion.

Evaluation¶

After each training segment, evaluation scripts run on the resulting checkpoint. Each script receives a HuggingFace checkpoint path and writes metrics to a YAML file. Results are automatically logged to Weights & Biases.

Built-in scripts cover benchmarks (via OLMES), perplexity, fictional knowledge, verbatim memorization, and more. You can also write custom scripts.

For full details, see Evaluation.