Data Insertion¶
This document describes how to configure text and token insertions into training data.
Experiment Types¶
add-texts-from-file¶
Insert text strings from a JSONL file. Texts are automatically tokenized.
experiments:
seed: 42
experiments:
- name: my-injection
type: add-texts-from-file
file: path/to/texts.jsonl
key: "prompt" # Field containing the text (default: "prompt")
repetitions: 1 # How many times to repeat each text (default: 1)
mode: random # Insertion mode (default: "random")
JSONL file format:
{"prompt": "This is the first text to insert."}
{"prompt": "This is the second text to insert."}
add-tokens-from-file¶
Insert pre-tokenized sequences from a JSONL file.
experiments:
seed: 42
experiments:
- name: my-injection
type: add-tokens-from-file
file: path/to/tokens.jsonl
key: "tokens" # Field containing token list
repetitions: 1
mode: random
JSONL file format:
{"tokens": [100257, 1212, 374, 264, 1296, 13, 100257]}
{"tokens": [100257, 14364, 1917, 0, 100257]}
Insertion Modes¶
random (default)¶
Inserts content at random positions across the entire training run.
- name: my-injection
type: add-texts-from-file
file: texts.jsonl
mode: random
Positions chosen randomly within the training token range
EOS tokens automatically added at boundaries
Insertions never overlap with each other
random-range¶
Inserts content at random positions within a specified token range.
- name: early-injection
type: add-texts-from-file
file: texts.jsonl
mode: random-range
start_token: 0
end_token: 100000000
Same behavior as
random, but constrained to[start_token, end_token)Useful for concentrating insertions in specific training phases
explicit¶
Insert content at exact positions specified in the JSONL file.
- name: precise-injection
type: add-texts-from-file
file: texts_with_positions.jsonl
key: "text"
mode: explicit
position_key: "position" # Field containing the position (default: "position")
add_eos: false # Whether to add EOS tokens (default: false)
JSONL file format for explicit mode:
{"text": "Insert at position 12345", "position": 12345}
{"text": "Insert at position 67890", "position": 67890}
Positions are used exactly as specified
No automatic EOS wrapping unless
add_eos: truerepetitionsparameter is ignored (positions are fixed)
Configuration Reference¶
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
required |
Path to JSONL file |
|
string |
“prompt” |
Field name containing text/tokens |
|
float |
1 |
Repetition multiplier (ignored for explicit mode) |
|
string |
“random” |
One of: “random”, “random-range”, “explicit” |
|
int |
- |
Start of range (random-range mode only) |
|
int |
- |
End of range (random-range mode only) |
|
string |
“position” |
Field name containing position (explicit mode only) |
|
bool |
false |
Add EOS tokens (explicit mode only) |
Examples¶
Basic random insertion¶
experiments:
seed: 42
experiments:
- name: knowledge-injection
type: add-texts-from-file
file: ${RESOURCE_PATH}/knowledge.jsonl
repetitions: 4
Early training injection¶
experiments:
seed: 42
experiments:
- name: early-knowledge
type: add-texts-from-file
file: knowledge.jsonl
mode: random-range
start_token: 0
end_token: 50000000
Precise positioning¶
experiments:
seed: 42
experiments:
- name: precise-injection
type: add-tokens-from-file
file: positioned_tokens.jsonl
key: "tokens"
mode: explicit
position_key: "pos"
add_eos: true