Procedural Pretraining

Pretraining on procedurally generated, abstract structured data for improving performance and data efficiency across language, code, math, and vision.

Procedural pretraining overview: a randomly initialized model undergoes procedural pretraining on formal languages, simple algorithms, and cellular automata, producing a procedurally-pretrained model that then undergoes standard pretraining on text, code, and math to produce the final base model.

Overview

Current pretraining paradigms expose models directly to web-scale data, text, code, images, expecting them to simultaneously learn both world knowledge and reasoning mechanisms. We show that a brief initial exposure to procedural data, sequences generated by formal grammars and simple algorithms, completely devoid of semantic content, can dramatically improve subsequent training.

Much like how infants learn simple logic and pattern matching before higher reasoning, procedural pretraining builds general computational mechanisms into transformers before they encounter real-world data. This scaffolding accelerates convergence, improves final performance, and reduces data requirements across diverse domains.

We demonstrate these benefits for large language models (on natural language, code, and mathematics) and for vision transformers (on image classification), showing that procedural data injects useful modality-agnostic priors that complement standard training.

45%
fewer language tokens needed to reach equivalent loss on C4
33%
fewer code tokens needed on CodeParrot
+1.7%
ImageNet-1K accuracy gain with only 1% procedural budget
28%
fewer real images needed for equivalent ImageNet accuracy

Papers & Code

ICML 2026 Spotlight

Procedural Pretraining: Warming Up Language Models with Abstract Data

Procedural pretraining pipeline and per-domain results for language models
Liangze Jiang*, Zachary Shinnick*, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

* Equal contribution

@article{jiang2025proceduralpretraining,
  title={Procedural Pretraining: Warming Up Language Models with Abstract Data},
  author={Jiang, Liangze and Shinnick, Zachary and van den Hengel, Anton and Saratchandran, Hemanth and Teney, Damien},
  journal={arXiv preprint arXiv:2601.21725},
  year={2026},
}
CVPR 2026

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Procedural warm-up pipeline for vision transformers
Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel
@inproceedings{shinnick2026canyoulearn,
  title={Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers},
  author={Shinnick, Zachary and Jiang, Liangze and Saratchandran, Hemanth and Teney, Damien and van den Hengel, Anton},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}

Talks

ELLIS Reading Group

Procedural Pretraining

Talk given at ELLIS Reading Group DLMath&Efficiency.

Watch Video

Method

What is procedural data?

Procedural data is generated by explicit algorithms, not by trained models. It is infinite, controllable, and verifiable. We use several families of data-generating algorithms spanning different levels of structural complexity:

Balanced bracket sequences

Formal Languages

k-Dyck: Balanced parentheses with hierarchical, stack-based dependencies (context-free).

k-Dyck Shuffle: Crossing and interleaved dependencies (context-sensitive).

WW: A string concatenated with its copy (regular).

Sort, Set, and Stack operations

Simple Algorithms

Sort, Set (deduplication), Union, Reverse, Delete, Identity, sequence transformations that require precise symbol manipulation.

Stack: Simulating push/pop operations on a stack memory.

Cellular automaton grid evolution

Cellular Automata

ECA Rule 110: A binary sequence evolving via deterministic Markovian dynamics. The model predicts the next state of the automaton.

The procedural pretraining pipeline

Procedural pretraining is a two-stage process: a brief warm-up on procedural data, followed by standard training on the target domain. The warm-up is lightweight, typically 0.1–1% of the total training budget.

Stage 1: Procedural Warm-Up
( [ ] ) { }

Train on procedural sequences with next-token or masked-token prediction

Stage 2: Standard Training
Text Code Math Images

Continue on target domain data with standard objectives

For Language Models

GPT-2–style decoder-only transformers trained with next-token prediction. Procedural sequences use a character-level vocabulary. Token embeddings are reinitialized before standard pretraining since the vocabularies differ.

For Vision Transformers

Standard ViTs with their visual patch embeddings bypassed, abstract symbols are mapped to random frozen embeddings instead. The model is trained for masked-token prediction. After warm-up, the procedural embeddings and prediction head are discarded, and standard image training proceeds normally.

Evaluation settings

Additive Setting

Standard training budget is held fixed. Procedural data is added on top to measure absolute performance gains, testing whether procedural data provides a training signal that standard data alone does not impart.

Substitutive Setting

Total budget is fixed. Some standard data is replaced with procedural data to measure data savings, quantifying how efficiently procedural tokens can substitute for real data without degrading performance.

Key Results: Language Models

Procedural pretraining improves standard pretraining

A small amount of procedural data front-loaded before standard pretraining consistently accelerates training and improves final performance across natural language, code, and mathematics.

C4 Language

CodeParrot Code

DeepMind-Math Mathematics

Sort Set Union No pretraining
Procedural pretraining accelerates and improves standard pretraining across all three domains. Hover over the charts to see exact values at each step.

Remarkable data efficiency

In the substitutive setting, procedural pretraining can dramatically reduce the amount of standard data needed. As little as 0.1–0.3% extra procedural tokens enables equivalent performance with far less real data.

C4 Natural Language

655M
Baseline
367M total
288M
saved
+ Procedural
45% less data

CodeParrot Code

983M
Baseline
664M total
319M
saved
+ Procedural
33% less data

DeepMind-Math Mathematics

1.64B
Baseline
1.41B total
0.23B
+ Procedural
14% less data
Semantic tokens Saved

Scales with model size

The benefits of procedural pretraining persist and remain consistent when scaling up to 350M and 1.3B parameter models, trained on up to 10.5B tokens. Larger models continue to show clear improvements from the procedural warm-up.

C4: Language (Perplexity ↓)

350M
40.3
39.0
1.3B
28.8
27.3

CodeParrot: Code (Perplexity ↓)

350M
4.97
4.62
1.3B
3.45
3.36
No procedural pretraining Ours (Union)

These improvements also persist after downstream fine-tuning on WikiText-103, GLUE, and PY150, confirming that procedural pretraining provides lasting benefits not washed away by subsequent training.

Key Results: Vision Transformers

Learning to see without images

A ViT-B/16 pretrained on procedural k-Dyck data, with just 1% of the total training budget, followed by standard ImageNet-1K training achieves a +1.72% improvement in top-1 accuracy over default initialization. This procedural data contains no visual or semantic content whatsoever.

Procedural warm-up (79.2%) Default initialization (77.5%)
Procedural warm-up leads to a distinct and stronger optimization trajectory on ImageNet-1K. The model initialized with procedural warm-up achieves 79.2% accuracy vs. 77.5% for default initialization.

Accuracy across benchmarks

Procedural warm-up consistently improves downstream performance across all benchmarks, with an average +3.4% absolute improvement over default initialization. It outperforms both the Mimetic structured initialization and FractalDB visual warm-up.

Accuracy (%) on image classification benchmarks. Green values denote absolute improvements over default initialization.
Method ImageNet-1K Tiny-ImageNet Food-101 CIFAR-10 CIFAR-100 STL-10
Default initialization 77.49 55.42 74.52 91.29 68.52 60.52
Mimetic initialization 78.68 +1.19 57.20 +1.78 79.21 +4.69 92.89 +1.60 70.72 +2.20 65.37 +4.85
FractalDB warm-up 78.06 +0.57 55.17 -0.25 74.25 -0.27 88.98 -2.31 64.61 -3.91 58.62 -1.90
Procedural warm-up (ours) 79.21 +1.72 58.20 +2.78 79.47 +4.95 92.81 +1.52 71.98 +3.46 66.48 +5.96

Complementary to ImageNet pretraining

The benefits of procedural warm-up persist even when combined with large-scale ImageNet-1K pretraining and subsequent fine-tuning. This confirms that procedural data provides a qualitatively different, complementary training signal, not merely a head-start on standard visual pretraining.

Accuracy (%) of ViT-B models pretrained on ImageNet-1K and fine-tuned on target datasets. The improvements persist throughout large-scale pretraining and fine-tuning stages.
Method Tiny-ImageNet Food-101 CIFAR-10 CIFAR-100 STL-10
Default init. + ImageNet 86.59 89.64 98.59 87.54 98.55
Mimetic init. + ImageNet 87.29 +0.70 90.74 +1.10 98.68 +0.09 88.78 +1.24 98.81 +0.26
FractalDB + ImageNet 88.42 +1.83 90.13 +0.49 98.41 -0.18 88.35 +0.81 98.46 -0.09
Procedural warm-up + ImageNet 87.93 +1.34 90.79 +1.15 98.68 +0.09 89.20 +1.66 98.66 +0.11

Data efficiency for vision

In the substitutive regime, replacing only 1% of the total pretraining budget with procedural data allows the model to match the accuracy of full ImageNet-1K pretraining while using 28% fewer image samples (approximately 108 million fewer images).

ImageNet-1K Vision

380M
Baseline
264M total
116M
saved
+ Procedural
28% fewer images
Image examples Saved

Analysis & Insights

Structure matters, not statistics

Shuffling the token order within procedural sequences, preserving the token distribution but destroying the structural dependencies, eliminates all benefits and can even hurt performance. This confirms that the gains arise from the algorithmic structure in the data, not from token frequency or co-occurrence statistics.

LLMs on algorithmic tasks

Different types of procedural data significantly improve specific algorithmic skills. Shuffling the sequences (removing structure) drops performance back to baseline.

Context Recall
Best shuffled
10.3
No pretraining
11.3
Dyck
96.9
Reversed Addition
Best shuffled
65.0
No pretraining
76.4
ECA
91.0
Multiplication
Best shuffled
48.4
No pretraining
42.7
Union
63.5
Dyck ECA Union Best shuffled No pretraining

ViTs on image classification

Shuffling k-Dyck sequences preserves token frequencies but removes hierarchical structure. Performance drops below even random initialization.

Method CIFAR-100 (%)
Default initialization 68.52
k-Dyck warm-up 71.98 +3.46
k-Dyck (shuffled sequences) 67.22 -1.30

Skills localize in specific components

The useful information from procedural pretraining is distributed across both attention and MLP layers, but the importance of each depends on the target domain:

Attention layers

Most important for structured domains like pure code (JavaCorpus). Attention-only transfer from procedural pretraining outperforms full-model transfer in some cases.

MLP layers

Most important for natural language (WikiText, C4). MLP-only transfer can require even fewer tokens than full-model transfer in the substitutive setting.

Both components

For domains mixing language with structure (documented code, informal math), full-model transfer combines benefits from both types of layers.

Late layers benefit most (ViTs)

In vision transformers, the procedural warm-up primarily affects the late (deep) layers, which account for most of the performance gains. This is a surprising finding, since standard visual pretraining is known to act primarily on early layers that capture low-level features. It suggests that procedural warm-up provides a qualitatively different type of signal.

Layerwise transfer from procedurally pretrained ViT to CIFAR-100.
Transferred Layers CIFAR-100 (%)
Default initialization 68.52
First 4 layers 68.91 +0.39
Middle 4 layers 70.19 +1.67
Final 4 layers 71.66 +3.14
All layers 71.98 +3.46

Combining multiple types of procedural data

The benefits of different types of procedural data are additive. Two combination strategies show promise:

Data mixtures

Pretraining on a mixture of multiple types of procedural data (e.g., Set + Union) can outperform either type alone, achieving better perplexity on both language and code.

Weight mixtures

Assembling a model from the attention layers of one procedurally-pretrained model and the MLP layers of another yields strong performance across all evaluation tasks, combining complementary skills modularly.

Scatter plot of perplexity vs mixture diversity on WikiText, showing that mixed procedural data improves language perplexity over single-type baselines
Language (WikiText): mixing procedural data types improves perplexity over single-type baselines.
Scatter plot of perplexity vs mixture diversity on JavaCorpus, showing that mixed procedural data improves code perplexity over single-type baselines
Code (JavaCorpus): data mixtures can outperform pure procedural pretraining on code.
Weight-level combination: SET attention + ECA MLP layers achieve the best average performance.
Configuration Haystack Addition Rev. Add. Sort Avg.
No procedural pretraining 11.3 59.1 76.4 82.7 57.4
SET (attention-only) 88.9 81.1 54.4 98.1 80.6
ECA (full-model) 10.5 69.6 91.0 76.9 62.0
SET (attn.) + ECA (MLP) 94.4 80.3 82.9 99.4 89.3

BibTeX

Paper 1: Procedural Pretraining (LLMs)

@article{jiang2025proceduralpretraining,
  title={Procedural Pretraining: Warming Up Language Models with Abstract Data},
  author={Jiang, Liangze and Shinnick, Zachary and van den Hengel, Anton and Saratchandran, Hemanth and Teney, Damien},
  journal={arXiv preprint arXiv:2601.21725},
  year={2026},
}

Paper 2: Procedural Warm-Up for ViTs (CVPR 2026)

@inproceedings{shinnick2026canyoulearn,
  title={Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers},
  author={Shinnick, Zachary and Jiang, Liangze and Saratchandran, Hemanth and Teney, Damien and van den Hengel, Anton},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}