Procedural Pretraining

Overview

Current pretraining paradigms expose models directly to web-scale data, text, code, images, expecting them to simultaneously learn both world knowledge and reasoning mechanisms. We show that a brief initial exposure to procedural data, sequences generated by formal grammars and simple algorithms, completely devoid of semantic content, can dramatically improve subsequent training.

Much like how infants learn simple logic and pattern matching before higher reasoning, procedural pretraining builds general computational mechanisms into transformers before they encounter real-world data. This scaffolding accelerates convergence, improves final performance, and reduces data requirements across diverse domains.

We demonstrate these benefits for large language models (on natural language, code, and mathematics) and for vision transformers (on image classification), showing that procedural data injects useful modality-agnostic priors that complement standard training.

45%

fewer language tokens needed to reach equivalent loss on C4

33%

fewer code tokens needed on CodeParrot

+1.7%

ImageNet-1K accuracy gain with only 1% procedural budget

28%

fewer real images needed for equivalent ImageNet accuracy

Papers & Code

ICML 2026 Spotlight

Procedural Pretraining: Warming Up Language Models with Abstract Data

Procedural pretraining pipeline and per-domain results for language models

Liangze Jiang*, Zachary Shinnick*, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

* Equal contribution

arXiv Code

@article{jiang2025proceduralpretraining,
  title={Procedural Pretraining: Warming Up Language Models with Abstract Data},
  author={Jiang, Liangze and Shinnick, Zachary and van den Hengel, Anton and Saratchandran, Hemanth and Teney, Damien},
  journal={arXiv preprint arXiv:2601.21725},
  year={2026},
}

CVPR 2026

Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers

Procedural warm-up pipeline for vision transformers

Zachary Shinnick, Liangze Jiang, Hemanth Saratchandran, Damien Teney, Anton van den Hengel

arXiv Code

@inproceedings{shinnick2026canyoulearn,
  title={Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers},
  author={Shinnick, Zachary and Jiang, Liangze and Saratchandran, Hemanth and Teney, Damien and van den Hengel, Anton},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}

Talks

ELLIS Reading Group

Procedural Pretraining

Talk given at ELLIS Reading Group DLMath&Efficiency.

Watch Video

Method

What is procedural data?

Procedural data is generated by explicit algorithms, not by trained models. It is infinite, controllable, and verifiable. We use several families of data-generating algorithms spanning different levels of structural complexity:

Formal Languages

k-Dyck: Balanced parentheses with hierarchical, stack-based dependencies (context-free).

k-Dyck Shuffle: Crossing and interleaved dependencies (context-sensitive).

WW: A string concatenated with its copy (regular).

Simple Algorithms

Sort, Set (deduplication), Union, Reverse, Delete, Identity, sequence transformations that require precise symbol manipulation.

Stack: Simulating push/pop operations on a stack memory.

Cellular Automata

ECA Rule 110: A binary sequence evolving via deterministic Markovian dynamics. The model predicts the next state of the automaton.

The procedural pretraining pipeline

Procedural pretraining is a two-stage process: a brief warm-up on procedural data, followed by standard training on the target domain. The warm-up is lightweight, typically 0.1–1% of the total training budget.

Stage 1: Procedural Warm-Up

( [ ] ) { }

Train on procedural sequences with next-token or masked-token prediction

Stage 2: Standard Training

Text Code Math Images

Continue on target domain data with standard objectives

For Language Models

GPT-2–style decoder-only transformers trained with next-token prediction. Procedural sequences use a character-level vocabulary. Token embeddings are reinitialized before standard pretraining since the vocabularies differ.

For Vision Transformers

Standard ViTs with their visual patch embeddings bypassed, abstract symbols are mapped to random frozen embeddings instead. The model is trained for masked-token prediction. After warm-up, the procedural embeddings and prediction head are discarded, and standard image training proceeds normally.

Evaluation settings

Additive Setting

Standard training budget is held fixed. Procedural data is added on top to measure absolute performance gains, testing whether procedural data provides a training signal that standard data alone does not impart.

Substitutive Setting

Total budget is fixed. Some standard data is replaced with procedural data to measure data savings, quantifying how efficiently procedural tokens can substitute for real data without degrading performance.

Key Results: Language Models

Procedural pretraining improves standard pretraining

A small amount of procedural data front-loaded before standard pretraining consistently accelerates training and improves final performance across natural language, code, and mathematics.

C4 Language

CodeParrot Code

DeepMind-Math Mathematics

Sort Set Union No pretraining

Procedural pretraining accelerates and improves standard pretraining across all three domains. Hover over the charts to see exact values at each step.

Remarkable data efficiency

In the substitutive setting, procedural pretraining can dramatically reduce the amount of standard data needed. As little as 0.1–0.3% extra procedural tokens enables equivalent performance with far less real data.

C4 Natural Language

655M

Baseline

367M total

288M
saved

+ Procedural

45% less data

CodeParrot Code

983M

Baseline

664M total

319M
saved

+ Procedural

33% less data

DeepMind-Math Mathematics

1.64B

Baseline

1.41B total

0.23B

+ Procedural

14% less data

Semantic tokens Saved

Scales with model size

The benefits of procedural pretraining persist and remain consistent when scaling up to 350M and 1.3B parameter models, trained on up to 10.5B tokens. Larger models continue to show clear improvements from the procedural warm-up.

C4: Language (Perplexity ↓)

350M

40.3

39.0

1.3B

28.8

27.3

CodeParrot: Code (Perplexity ↓)

350M

4.97

4.62

1.3B

3.45

3.36

No procedural pretraining Ours (Union)

These improvements also persist after downstream fine-tuning on WikiText-103, GLUE, and PY150, confirming that procedural pretraining provides lasting benefits not washed away by subsequent training.

Key Results: Vision Transformers

Learning to see without images

A ViT-B/16 pretrained on procedural k-Dyck data, with just 1% of the total training budget, followed by standard ImageNet-1K training achieves a +1.72% improvement in top-1 accuracy over default initialization. This procedural data contains no visual or semantic content whatsoever.

Procedural warm-up (79.2%) Default initialization (77.5%)

Procedural warm-up leads to a distinct and stronger optimization trajectory on ImageNet-1K. The model initialized with procedural warm-up achieves 79.2% accuracy vs. 77.5% for default initialization.

Accuracy across benchmarks

Procedural warm-up consistently improves downstream performance across all benchmarks, with an average +3.4% absolute improvement over default initialization. It outperforms both the Mimetic structured initialization and FractalDB visual warm-up.

Accuracy (%) on image classification benchmarks. Green values denote absolute improvements over default initialization.
Method	ImageNet-1K	Tiny-ImageNet	Food-101	CIFAR-10	CIFAR-100	STL-10
Default initialization	77.49	55.42	74.52	91.29	68.52	60.52
Mimetic initialization	78.68 +1.19	57.20 +1.78	79.21 +4.69	92.89 +1.60	70.72 +2.20	65.37 +4.85
FractalDB warm-up	78.06 +0.57	55.17 -0.25	74.25 -0.27	88.98 -2.31	64.61 -3.91	58.62 -1.90
Procedural warm-up (ours)	79.21 +1.72	58.20 +2.78	79.47 +4.95	92.81 +1.52	71.98 +3.46	66.48 +5.96

Complementary to ImageNet pretraining

The benefits of procedural warm-up persist even when combined with large-scale ImageNet-1K pretraining and subsequent fine-tuning. This confirms that procedural data provides a qualitatively different, complementary training signal, not merely a head-start on standard visual pretraining.

Accuracy (%) of ViT-B models pretrained on ImageNet-1K and fine-tuned on target datasets. The improvements persist throughout large-scale pretraining and fine-tuning stages.
Method	Tiny-ImageNet	Food-101	CIFAR-10	CIFAR-100	STL-10
Default init. + ImageNet	86.59	89.64	98.59	87.54	98.55
Mimetic init. + ImageNet	87.29 +0.70	90.74 +1.10	98.68 +0.09	88.78 +1.24	98.81 +0.26
FractalDB + ImageNet	88.42 +1.83	90.13 +0.49	98.41 -0.18	88.35 +0.81	98.46 -0.09
Procedural warm-up + ImageNet	87.93 +1.34	90.79 +1.15	98.68 +0.09	89.20 +1.66	98.66 +0.11

Data efficiency for vision

In the substitutive regime, replacing only 1% of the total pretraining budget with procedural data allows the model to match the accuracy of full ImageNet-1K pretraining while using 28% fewer image samples (approximately 108 million fewer images).

ImageNet-1K Vision

380M

Baseline

264M total

116M
saved

+ Procedural

28% fewer images

Image examples Saved

Analysis & Insights

Structure matters, not statistics

Shuffling the token order within procedural sequences, preserving the token distribution but destroying the structural dependencies, eliminates all benefits and can even hurt performance. This confirms that the gains arise from the algorithmic structure in the data, not from token frequency or co-occurrence statistics.

LLMs on algorithmic tasks

Different types of procedural data significantly improve specific algorithmic skills. Shuffling the sequences (removing structure) drops performance back to baseline.

Context Recall

Best shuffled

10.3

No pretraining

11.3

Dyck

96.9

Reversed Addition

Best shuffled

65.0

No pretraining

76.4

ECA

91.0

Multiplication

Best shuffled

48.4

No pretraining

42.7

Union

63.5

Dyck ECA Union Best shuffled No pretraining

ViTs on image classification

Shuffling k-Dyck sequences preserves token frequencies but removes hierarchical structure. Performance drops below even random initialization.

Method	CIFAR-100 (%)
Default initialization	68.52
k-Dyck warm-up	71.98 +3.46
k-Dyck (shuffled sequences)	67.22 -1.30

Skills localize in specific components

The useful information from procedural pretraining is distributed across both attention and MLP layers, but the importance of each depends on the target domain:

Attention layers

Most important for structured domains like pure code (JavaCorpus). Attention-only transfer from procedural pretraining outperforms full-model transfer in some cases.

MLP layers

Most important for natural language (WikiText, C4). MLP-only transfer can require even fewer tokens than full-model transfer in the substitutive setting.

Both components

For domains mixing language with structure (documented code, informal math), full-model transfer combines benefits from both types of layers.

Late layers benefit most (ViTs)

In vision transformers, the procedural warm-up primarily affects the late (deep) layers, which account for most of the performance gains. This is a surprising finding, since standard visual pretraining is known to act primarily on early layers that capture low-level features. It suggests that procedural warm-up provides a qualitatively different type of signal.

Layerwise transfer from procedurally pretrained ViT to CIFAR-100.
Transferred Layers	CIFAR-100 (%)
Default initialization	68.52
First 4 layers	68.91 +0.39
Middle 4 layers	70.19 +1.67
Final 4 layers	71.66 +3.14
All layers	71.98 +3.46

Combining multiple types of procedural data

The benefits of different types of procedural data are additive. Two combination strategies show promise:

Data mixtures

Pretraining on a mixture of multiple types of procedural data (e.g., Set + Union) can outperform either type alone, achieving better perplexity on both language and code.

Weight mixtures

Assembling a model from the attention layers of one procedurally-pretrained model and the MLP layers of another yields strong performance across all evaluation tasks, combining complementary skills modularly.

Scatter plot of perplexity vs mixture diversity on WikiText, showing that mixed procedural data improves language perplexity over single-type baselines — Language (WikiText): mixing procedural data types improves perplexity over single-type baselines.

Scatter plot of perplexity vs mixture diversity on JavaCorpus, showing that mixed procedural data improves code perplexity over single-type baselines — Code (JavaCorpus): data mixtures can outperform pure procedural pretraining on code.

Weight-level combination: SET attention + ECA MLP layers achieve the best average performance.
Configuration	Haystack	Addition	Rev. Add.	Sort	Avg.
No procedural pretraining	11.3	59.1	76.4	82.7	57.4
SET (attention-only)	88.9	81.1	54.4	98.1	80.6
ECA (full-model)	10.5	69.6	91.0	76.9	62.0
SET (attn.) + ECA (MLP)	94.4	80.3	82.9	99.4	89.3

BibTeX

Paper 1: Procedural Pretraining (LLMs)

@article{jiang2025proceduralpretraining,
  title={Procedural Pretraining: Warming Up Language Models with Abstract Data},
  author={Jiang, Liangze and Shinnick, Zachary and van den Hengel, Anton and Saratchandran, Hemanth and Teney, Damien},
  journal={arXiv preprint arXiv:2601.21725},
  year={2026},
}

Paper 2: Procedural Warm-Up for ViTs (CVPR 2026)

@inproceedings{shinnick2026canyoulearn,
  title={Can You Learn to See Without Images? Procedural Warm-Up for Vision Transformers},
  author={Shinnick, Zachary and Jiang, Liangze and Saratchandran, Hemanth and Teney, Damien and van den Hengel, Anton},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026},
}