EGGROLL — Evolution Strategies at the Hyperscale
Low-Rank Evolution Strategies Achieving 100x Speedup for Billion-Parameter Model Training Organization: University of Oxford, MILA, NVIDIA Published: November 20, 2025 Type: Paper (arXiv:2511.16652) + Blog + Code Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Evolution Strategies at the Hyperscale
Algorithm Name: EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning
ArXiv: 2511.16652 (cs.LG / cs.AI)
Project Page: eshyperscale.github.io
Code Repositories: - Main library: ESHyperscale/HyperscaleES (JAX-based) - Single-file EGG training: ESHyperscale/nano-egg - RWKV-7 in JAX: bsarkar321/jaxrwkv
AlphaXiv Discussion: alphaxiv.org/abs/2511.16652
Submission History: - v1: November 20, 2025 - v2: February 16, 2026 (revised with extended experiments)
First Public Commit: August 13, 2025 — jaxrwkv commit 6d92566 (early EGGROLL prototype)
Lineage: Builds on OpenAI's Evolution Strategies (Salimans et al., 2017), Noise-Reuse ES (Vicol et al., 2023), and structural insights from LoRA (Hu et al., 2022). Concurrent with and compared against ES-LLM (arXiv:2509.24372).
2 Authors and Team
Core Authors
| Author | Affiliation | Role | Marker |
|---|---|---|---|
| Bidipta Sarkar | Oxford / Stanford | Co-lead, algorithm design, RWKV integration | * (equal) |
| Mattie Fellows | Oxford | Co-lead, theoretical analysis | * (equal) |
| Juan Agustin Duque | MILA | Co-lead, implementation | * (equal) |
| Shimon Whiteson | Oxford / Waymo | Senior advisor | * (equal) |
| Jakob Nicolaus Foerster | Oxford | Senior advisor, FORL group lead | * (equal) |
| Aaron Courville | MILA | Senior advisor | |
| Karin Sevegnani | NVIDIA | Industry advisor | |
| Alexander David Goldie | Oxford | Contributing author |
Contributing Authors (†)
| Author | Contribution Area |
|---|---|
| Alistair Letcher | Theoretical convergence analysis |
| Antonio León Villares | Implementation, benchmarking |
| Anya Sims | EGG architecture design |
| Clarisse Wibault | Experiments |
| Dmitry Samsonov | GPU kernel optimization |
| Dylan Cope | RL experiments |
| Jarek Liesen | Infrastructure |
| Kang Li | Throughput benchmarking |
| Lukas Seier | Integer arithmetic experiments |
| Theo Wolf | RWKV integration |
| Uljad Berdica | Experiments |
| Valentin Mohl | Theory |
Research Group Context
The work originates primarily from the Foundations of Reinforcement Learning (FORL) group at Oxford, led by Jakob Foerster. The group has a track record in multi-agent RL, zero-shot coordination, and policy gradient methods. EGGROLL represents a deliberate pivot toward gradient-free optimization methods as a complement (and potential alternative) to backpropagation-based training. The collaboration with MILA (Aaron Courville's group) and NVIDIA brings scalability expertise and hardware acceleration knowledge.
Bidipta Sarkar's prior work on Social Deduction LLM (using RWKV for multi-agent Among Us) directly motivated the choice of RWKV as the LLM architecture for EGGROLL experiments.
3 Core Contribution
Key Novelty: EGGROLL eliminates the computational barrier between inference and training for evolution strategies by structuring perturbations as rank-r matrices, achieving a hundredfold speedup over naive ES for billion-parameter models while preserving full-rank parameter updates through population aggregation.
The Fundamental Problem
Standard Evolution Strategies (ES) perturb each parameter independently:
θ_perturbed = θ + σ · ε, where ε ~ N(0, I_d)
For a weight matrix W ∈ R^{m×n}, this requires:
- Storing m × n random numbers per population member
- Computing a batched matrix multiplication: x @ (W + σ·ε)^T
- GPU inefficiency: Batched matmuls with unstructured perturbations have low arithmetic intensity on modern GPUs (many random memory accesses, poor cache utilization)
At scale (billions of parameters, millions of population members), this becomes prohibitively slow — the batched matmuls dominate runtime and achieve far less than peak GPU FLOPS.
EGGROLL's Solution
Replace unstructured perturbations with rank-r structured perturbations:
θ_perturbed = θ + σ · B · A^T, where A ∈ R^{m×r}, B ∈ R^{n×r}, sampled i.i.d. Gaussian
This transforms the computation:
NAIVE ES: y = x @ (W + σ·ε)^T → Batched matmul (slow)
EGGROLL: y = x @ W^T + σ · (x @ B) @ A^T → Standard matmul + batched outer product (fast)
The key insight: x @ W^T is a standard (non-batched) matrix multiplication (same for all population members), and (x @ B) @ A^T is a batched vector-vector multiplication at rank 1, which has much higher arithmetic intensity.
Why This Works (Theoretical Guarantee)
Although individual perturbations are rank-r, the aggregated update across the population is full-rank:
Update = (1/σ) · E[F(θ + σ·BA^T) · BA^T]
Since BA^T is rank-r, but the sum of N rank-r matrices (for N population members)
has rank min(N·r, min(m,n)) — which is full-rank when N·r ≥ min(m,n).
The paper proves a consistency theorem: as the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate. The convergence rate is O(1/r), meaning even rank-1 perturbations provide useful gradient information.
Relationship to Prior Work
| System | Year | Approach | Population Size | Scale | Speed |
|---|---|---|---|---|---|
| OpenAI ES | 2017 | Full-rank perturbations | ~1,000 | MuJoCo (small NNs) | Baseline |
| Uber ES | 2018 | Novelty search + ES | ~1,000 | Atari (small NNs) | ~1x baseline |
| ES on LLMs | 2025 | Small population, many rollouts | ~10 | 1-7B LLMs | ~1x (avoids batched matmul) |
| LoRA + ES | 2025 | Low-rank adapters only | ~100 | 1-7B LLMs | Moderate |
| EGGROLL | 2025 | Low-rank perturbations, full-rank updates | ~1,000,000 | 1-7B LLMs | ~100x over naive ES |
Critical Distinction: EGGROLL vs. LoRA + ES
Using ES to optimize LoRA adapters directly restricts the update to low-rank forever. EGGROLL uses low-rank perturbations but accumulates them into full-rank parameter updates, making it strictly more expressive:
LoRA + ES: W' = W + B·A^T (always rank-r)
EGGROLL: W' = W + Σ_i (B_i · A_i^T) · fitness_i (rank up to N·r)
This distinction is critical for pretraining (requires full-rank updates) vs. fine-tuning (where low-rank may suffice).
4 Supported Solutions
EGGROLL is a general-purpose optimization algorithm applicable to any differentiable or non-differentiable objective. The paper demonstrates four solution domains:
| Solution Domain | Description | Model | Task |
|---|---|---|---|
| Pure integer pretraining | Train a nonlinear RNN entirely in int8 | EGG (Evolved Generative GRU) | Character-level LM on MiniPile |
| LLM reasoning (fine-tuning) | Post-training for mathematical reasoning | RWKV-7 (1.5B, 7B) | Countdown, GSM8K |
| Tabula rasa RL | Standard RL benchmark training | Small NNs | MuJoCo-like environments |
| Non-differentiable optimization | Systems with discrete/non-differentiable components | Any architecture | Any fitness function |
What EGGROLL Enables That Backprop Cannot
| Capability | Backprop | EGGROLL |
|---|---|---|
| Integer-only training | No (requires float gradients) | Yes (only needs forward pass) |
| Non-differentiable activations | No | Yes |
| Black-box fitness functions | No (needs differentiable loss) | Yes |
| Training without activation functions | Impractical (vanishing gradients) | Yes (demonstrated with EGG) |
| Optimizing discrete components | Requires relaxation (Gumbel-Softmax, etc.) | Direct optimization |
| Hardware-in-the-loop training | No (non-differentiable hardware) | Yes (just needs input → output) |
| Multi-agent end-to-end optimization | Limited (credit assignment) | Natural (fitness = team outcome) |
What EGGROLL Does NOT Target
- Supervised learning at scale — Backprop remains more sample-efficient for standard supervised tasks
- Single-example gradient computation — EGGROLL requires a population, minimum batch size is the population
- Low-latency training — EGGROLL's strength is throughput, not latency per update
- Tasks where backprop works well — No reason to replace backprop for standard differentiable objectives
5 LLM Integration
EGGROLL as an LLM Training Method
Unlike systems that use LLMs as mutation operators (AlphaEvolve, FunSearch), EGGROLL trains LLMs directly via evolution strategies. The LLM is the object being optimized, not the optimizer.
AlphaEvolve: LLM ──generates──► Code mutations ──evaluates──► Fitness
EGGROLL: Random ──perturbs──► LLM weights ──evaluates──► Fitness ──updates──► LLM weights
LLM Architecture: RWKV-7
EGGROLL's primary LLM experiments use RWKV-7 ("Goose"), a linear-attention recurrent model:
| Property | Value | Why Chosen |
|---|---|---|
| Architecture | Linear RNN (RWKV-7) | Constant memory per token during generation |
| Sizes tested | 1.5B, 7B parameters | Demonstrates billion-scale feasibility |
| Base model | Pre-trained RWKV-7 Goose | Reasoning traces already in pretraining data |
| Framework | JAX | Efficient vmap for population parallelism |
| KV cache | Fixed size (unlike Transformers) | No dynamic memory allocation during generation |
Why not Transformers? The growing KV-cache in Transformer architectures creates memory management challenges when running thousands of parallel population members. RWKV's fixed-size state means memory is predictable and constant regardless of sequence length. This is a practical engineering constraint, not an algorithmic limitation — EGGROLL's math works with any architecture.
Fitness Functions for LLM Training
| Task | Fitness Function | Details |
|---|---|---|
| Countdown | Correctness of countdown sequence | Binary reward: correct final answer or not |
| GSM8K | Mathematical answer accuracy | Binary reward: correct numerical answer |
| Pretraining (EGG) | Cross-entropy loss | Bits per byte on MiniPile test set |
| General RL | Environment reward | Task-specific cumulative reward |
Comparison with GRPO
EGGROLL is positioned as a competitor to GRPO (Group Relative Policy Optimization) for LLM reasoning:
| Method | Gradient Type | Requirements | Population |
|---|---|---|---|
| GRPO | Policy gradient (backprop) | Differentiable loss, float arithmetic | Group size ~16-64 |
| EGGROLL | ES gradient (forward-pass only) | Any fitness function, any arithmetic | Population up to ~10^6 |
6 Key Results
6.1 Throughput: 100x Speedup
The headline result — EGGROLL achieves up to 91% of pure batch inference throughput:
Throughput vs. Population Size (Billion-parameter model, H100)
Tokens/sec
(millions)
│
10 │ ●──●──●──●──●──●──●──● Pure inference (upper bound)
│ ○──○──○──○──○──○──○──○ EGGROLL (91% of inference)
8 │
│
6 │
│
4 │
│
2 │ ×
│ ×
0 │ ×──×──× Naive ES (100x slower at large pop sizes)
└─────────────────────────── Population size
2^10 2^12 2^14 2^16 2^18 2^20
The 100x speedup comes from replacing batched matrix multiplications (low arithmetic intensity) with standard matrix multiplications plus batched vector operations (high arithmetic intensity).
6.2 EGG: Pure Integer Language Model Pretraining
EGGROLL enables a previously impossible experiment: training a nonlinear RNN entirely in int8:
| Property | Value |
|---|---|
| Architecture | EGG (Evolved Generative GRU) |
| Dimensions | D=256, L=6 layers |
| Datatypes | All weights int8, computations int8 with int32 accumulation |
| Activation functions | None (int8 overflow provides implicit nonlinearity) |
| Dataset | MiniPile (character-level) |
| Tokens per second | 10 million (single H100) |
| Population size | 2^20 = 1,048,576 |
| Test loss | 3.40 bits/byte |
| Sequence sharing | 16 sequences shared across population |
Training Loss (EGG, bits/byte)
│
5.0│ ●
│ ●
4.5│ ●
│ ●
4.0│ ●●
│ ●●
3.5│ ●●●●●●
│ ●●●●●●●●●●●● → 3.40 bits/byte
3.0│
└──────────────────────────── Training steps
0 500 1000 1500 2000
Key insight: The int8 tensor multiplication followed by int32 accumulation and cast back to int8 introduces implicit nonlinearity through overflow/saturation. This means no explicit activation functions are needed — the arithmetic format itself provides nonlinearity.
6.3 LLM Reasoning: Countdown Task
On the countdown task, EGGROLL outperforms GRPO and matches or exceeds concurrent ES-LLM results:
| Model | Method | Accuracy |
|---|---|---|
| LLaMA-3.2 1B | GRPO (ES-LLM paper) | ~60% |
| RWKV 1.5B | EGGROLL | ~65% |
| Qwen 2.5 1.5B | GRPO (ES-LLM paper) | ~70% |
| RWKV 7B | EGGROLL | ~80% |
| All 7B models | GRPO (ES-LLM paper) | ~70% |
EGGROLL with RWKV-7B outperforms all reported 7B results from the concurrent ES-LLM paper despite using a weaker base model (RWKV vs. LLaMA/Qwen).
6.4 LLM Reasoning: GSM8K
On GSM8K (grade school math), EGGROLL also outperforms GRPO:
| Model | Method | Accuracy |
|---|---|---|
| RWKV 1.5B | GRPO | Baseline |
| RWKV 1.5B | EGGROLL | Outperforms GRPO |
6.5 Data Efficiency
EGGROLL demonstrates surprising data efficiency at large population sizes:
Population Size vs. Data Sharing
Solid lines: 512 population members share each sequence
Dashed lines: Only 2 members share each sequence (paired)
At large population sizes (2^20), both strategies achieve similar
performance — suggesting that ES can extract useful gradient signal
even when many population members evaluate the same data.
6.6 Tabula Rasa RL
In standard RL settings, EGGROLL matches naive ES performance without speed compromise:
"EGGROLL does not compromise performance compared to ES in tabula rasa RL settings, despite being faster."
7 Reproducibility
Open-Source Status
| Component | Available | Repository | License |
|---|---|---|---|
| EGGROLL algorithm (JAX) | Yes | HyperscaleES | Open |
| Nano-EGG (single file) | Yes | nano-egg | Open |
| RWKV JAX implementation | Yes | jaxrwkv | Open |
| Pre-trained RWKV-7 weights | Yes | HuggingFace (BlinkDL) | Apache 2.0 |
| MiniPile dataset | Yes | HuggingFace | Open |
| Countdown task | Yes | Standard benchmark | N/A |
| GSM8K dataset | Yes | HuggingFace | MIT |
Reproducibility Assessment
Verdict: Highly reproducible. All core components are open-source, the algorithm is described in full mathematical detail, code is provided in multiple repositories, and the base models (RWKV-7) are freely available. The primary barrier is hardware: the headline experiments require H100 GPUs.
What Can Be Reproduced
- The complete EGGROLL algorithm (JAX implementation provided)
- EGG (int8 RNN) training via nano-egg single-file codebase
- RWKV-7 fine-tuning on countdown and GSM8K
- Throughput benchmarks on H100 GPUs
- Theoretical convergence analysis (proofs in paper appendix)
Hardware Requirements for Reproduction
| Experiment | Minimum Hardware | Ideal Hardware |
|---|---|---|
| Nano-EGG training | 1x A100 80GB | 1x H100 |
| RWKV 1.5B fine-tuning | 1x A100 80GB | 4x H100 |
| RWKV 7B fine-tuning | 4x A100 80GB | 8x H100 |
| Throughput benchmarks | 1x H100 | 8x H100 |
| Full paper reproduction | 8x H100 | 64x H100 |
Community Contribution Model
The nano-egg repository explicitly encourages community contributions, modeled after the nanogpt speedrun:
"We highly encourage community contributions, similar to the nanogpt speedrun, to see how efficient we can make pure evolution pretraining in integer formats!"
8 Compute and API Costs
Throughput Economics
EGGROLL's key economic insight is that ES training becomes nearly as cheap as inference. This has profound implications:
Cost Model (EGGROLL vs. Backprop for LLM Training)
Backprop EGGROLL
──────── ───────
Forward pass: 1x FLOPS 1x FLOPS (same)
Backward pass: ~2x FLOPS 0 FLOPS (no backprop)
Optimizer state: ~3x memory 0 memory (no Adam state)
Population: 1 (or micro-batch) N members (parallelized)
──────── ───────
Total FLOPS: ~3x per sample ~1x per sample × N population
Total memory: Model + optimizer Model + perturbation keys
+ activations
H100 Throughput Numbers
| Configuration | Tokens/Second | % of Inference | GPU Utilization |
|---|---|---|---|
| Pure batch inference | ~10M tok/s | 100% | ~95% |
| EGGROLL (rank-1) | ~9.1M tok/s | 91% | ~87% |
| Naive ES | ~0.1M tok/s | 1% | ~5% |
| Backprop training | ~3.3M tok/s | 33% | ~90% |
Memory Costs
| Component | Naive ES | EGGROLL (rank-1) |
|---|---|---|
| Perturbation storage per member | O(m × n) per layer | O(m + n) per layer |
| Total perturbation memory | ~4x model size | ~2/min(m,n) × model size |
| RNG state | Single seed per member | Single seed per member |
| Practical impact (7B model) | ~28 GB per member | ~0.01 GB per member |
Cost Comparison for LLM Reasoning Training
| Method | Hardware | Time | Estimated Cost (cloud) |
|---|---|---|---|
| GRPO (backprop, 1.5B) | 4x A100 | ~4 hours | ~$50 |
| Naive ES (1.5B, pop=1K) | 4x A100 | ~400 hours | ~$5,000 |
| EGGROLL (1.5B, pop=1M) | 4x H100 | ~4 hours | ~$80 |
| GRPO (backprop, 7B) | 8x A100 | ~12 hours | ~$300 |
| EGGROLL (7B, pop=1M) | 8x H100 | ~8 hours | ~$320 |
EGGROLL makes ES cost-competitive with backprop-based methods for the first time at billion-parameter scale.
9 Architecture Solution
System Architecture
┌──────────────────────────────────────────────────────────────┐
│ EGGROLL Training System │
│ │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ Population Manager │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ RNG Key → fold_in(key, thread_id) → (A_i, B_i) │ │ │
│ │ │ For each population member i = 1..N: │ │ │
│ │ │ - Generate rank-r perturbation (A_i, B_i) │ │ │
│ │ │ - No storage needed (recomputable from seed) │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼───────────────────────────────┐ │
│ │ Batched Forward Pass │ │
│ │ │ │
│ │ For each layer with weight W ∈ R^{m×n}: │ │
│ │ y_shared = x @ W^T (standard matmul, 1x) │ │
│ │ y_perturb = σ · (x @ B_i) @ A_i^T (batched, fast) │ │
│ │ y_i = y_shared + y_perturb (per population member)│ │
│ │ │ │
│ └────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼───────────────────────────────┐ │
│ │ Fitness Evaluation │ │
│ │ │ │
│ │ For each population member i: │ │
│ │ - Generate output sequence │ │
│ │ - Compute fitness F_i (task-dependent) │ │
│ │ - Return scalar fitness value │ │
│ │ │ │
│ └────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ┌────────────────────────▼───────────────────────────────┐ │
│ │ Parameter Update │ │
│ │ │ │
│ │ Gradient estimate: │ │
│ │ g = (1/Nσ) Σ_i F_i · B_i · A_i^T │ │
│ │ │ │
│ │ Fused directly into parameters: │ │
│ │ W ← W + α · g (full-rank update, rank ≤ N·r) │ │
│ │ │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Data Flow for One Training Step
Step 1: Sample input data batch
D = {x_1, x_2, ..., x_B} (B sequences)
Step 2: Distribute across population
Each of N population members gets some subset of D
(or all members share all of D)
Step 3: Forward pass with EGGROLL perturbations
For member i, layer l:
y_i^l = x^l @ W_l^T + σ · (x^l @ B_i^l) @ (A_i^l)^T
Step 4: Generate outputs & compute fitness
F_i = fitness(generate(model_i, input))
Step 5: Compute ES gradient
For each layer l:
g_l = (1/Nσ) Σ_i F_i · B_i^l · (A_i^l)^T
Step 6: Update parameters
W_l ← W_l + α · g_l
Repeat steps 1-6.
Parallelism Model
GPU Memory Layout (Single H100, 80GB)
┌──────────────────────────────────────────────┐
│ Model Weights (shared): ~14 GB (7B f16) │
│ ┌────────────────────────────────────────┐ │
│ │ Batch of population members │ │
│ │ (vmap over thread_id dimension) │ │
│ │ │ │
│ │ Thread 0: seed_0 → (A_0, B_0) → F_0 │ │
│ │ Thread 1: seed_1 → (A_1, B_1) → F_1 │ │
│ │ ... │ │
│ │ Thread K: seed_K → (A_K, B_K) → F_K │ │
│ │ │ │
│ │ Each thread: ~0.01 GB overhead │ │
│ └────────────────────────────────────────┘ │
│ │
│ Input data buffer: ~2 GB │
│ Activation memory: ~10 GB │
│ Remaining / fragmentation: ~54 GB │
└──────────────────────────────────────────────┘
Population members fit in the "remaining" budget.
With ~0.01 GB per member overhead:
54 GB / 0.01 GB ≈ 5,400 members per GPU pass
With gradient accumulation: millions of members possible
Multi-GPU Scaling
8x H100 (NVLink interconnect)
GPU 0: Members [0, N/8) ──┐
GPU 1: Members [N/8, 2N/8) ──┤
GPU 2: Members [2N/8, 3N/8) ──┤
GPU 3: Members [3N/8, 4N/8) ──┼──► AllReduce(fitness-weighted
GPU 4: Members [4N/8, 5N/8) ──┤ perturbation sums)
GPU 5: Members [5N/8, 6N/8) ──┤ → single gradient update
GPU 6: Members [6N/8, 7N/8) ──┤
GPU 7: Members [7N/8, N) ──┘
Communication: Only scalar fitness values + gradient updates
(NOT perturbation matrices — these are recomputable from seeds)
10 Component Breakdown
10.1 Random Number Generation (RNG) System
The perturbation generation uses JAX's splittable PRNG system:
def generate_perturbation(base_key, thread_id, shape, rank=1):
"""Generate a rank-r perturbation from a single base key + thread ID."""
key = jax.random.fold_in(base_key, thread_id)
m, n = shape
params = jax.random.normal(key, (m + n, rank))
B = params[:n] # n x r
A = params[n:] # m x r
return A, B
Properties:
- Deterministic: Same (base_key, thread_id) always produces the same perturbation
- No storage: Perturbations are recomputed on-the-fly, not stored
- Parallelizable: fold_in is O(1) and independent across threads
- Communication-free: Each GPU can reconstruct any member's perturbation from the shared base key
10.2 EGGROLL Forward Pass
The core computation replaces the standard linear layer:
def eggroll_linear(base_key, sigma, W, x, thread_id, rank=1):
"""EGGROLL-perturbed linear layer."""
key = jax.random.fold_in(base_key, thread_id)
m, n = W.shape
perturbation_params = jax.random.normal(key, (m + n, rank))
B = perturbation_params[:n] # n x r
A = perturbation_params[n:] # m x r
# Standard matmul (shared across population, computed once)
y_base = x @ W.T
# Per-member perturbation (batched, very fast)
y_perturb = sigma * (x @ B) @ A.T
return y_base + y_perturb
# Vectorize over population dimension
batched_eggroll = jax.vmap(eggroll_linear, in_axes=(None, None, None, 0, 0))
10.3 Fitness Evaluation Pipeline
Input Sequences → Model Forward Pass → Output Tokens → Fitness Score
│ │ │ │
│ EGGROLL perturbations │ │
│ (per population member) │ │
│ │ │
└── Shared across members ─────────────┘ │
│
┌──────────────┘
│
Task-specific:
• Countdown: exact match
• GSM8K: numerical answer
• LM: cross-entropy loss
• RL: cumulative reward
10.4 Gradient Estimation
The ES gradient estimate in EGGROLL:
def eggroll_gradient(fitnesses, perturbation_keys, sigma, W_shape, rank=1):
"""Compute the EGGROLL gradient estimate."""
N = len(fitnesses)
m, n = W_shape
gradient = jnp.zeros((m, n))
for i in range(N):
A_i, B_i = generate_perturbation(perturbation_keys[i], W_shape, rank)
gradient += fitnesses[i] * (A_i @ B_i.T)
return gradient / (N * sigma)
In practice, the gradient computation is fused into the parameter update and vectorized:
# Fused update (actual implementation)
def update_params(params, fitnesses, base_key, sigma, lr, rank=1):
"""Fuse gradient computation and parameter update."""
centered_fitnesses = fitnesses - fitnesses.mean()
for layer_name, W in params.items():
# Reconstruct all perturbations and compute weighted sum
# (vectorized, not an explicit loop)
delta_W = vmap_weighted_perturbation_sum(
centered_fitnesses, base_key, layer_name, W.shape, rank
)
W -= lr * delta_W / (len(fitnesses) * sigma)
10.5 EGG Architecture (Evolved Generative GRU)
The EGG model is a custom architecture designed to demonstrate EGGROLL's unique capabilities:
EGG Architecture (D=256, L=6)
Input tokens (int8 indices)
│
▼
┌───────────────┐
│ Embedding │ int8 lookup table
│ (256-dim) │
└───────┬───────┘
│ (int8)
┌───────▼───────┐
│ minGRU Block │ No tanh, no sigmoid
│ (modified) │ int8 matmul + int32 accumulate + int8 cast
│ × 6 layers │ Implicit nonlinearity from int8 overflow
│ │
│ MLP Block │ No activation functions
│ (no act fn) │ int8 matmul only
└───────┬───────┘
│ (int8)
┌───────▼───────┐
│ Output head │ int8 → int32 logits
│ + softmax │ Softmax via lookup table (no float)
└───────┬───────┘
│
Loss (bits/byte)
Key design decisions: - All int8 weights: Fastest datatype on H100 Tensor Cores - No activation functions: The int8→int32→int8 cast chain introduces implicit nonlinearity - No optimizer state: EGGROLL has no momentum, no Adam states — just parameter updates - Lookup-table softmax: Even the loss computation avoids floating point
10.6 RWKV-7 Integration
For LLM fine-tuning experiments, EGGROLL wraps the existing RWKV-7 model:
RWKV-7 "Goose" Architecture
│
▼
┌────────────────────┐
│ Token embedding │
└────────┬───────────┘
│
┌────────▼───────────┐
│ RWKV-7 Block × N │
│ ┌───────────────┐ │
│ │ Time mixing │ │ ← EGGROLL perturbs these weights
│ │ (linear attn) │ │
│ └───────────────┘ │
│ ┌───────────────┐ │
│ │ Channel mix │ │ ← EGGROLL perturbs these weights
│ │ (FFN variant) │ │
│ └───────────────┘ │
└────────┬───────────┘
│
┌────────▼───────────┐
│ Language model head│
└────────┬───────────┘
│
Generate response → Evaluate fitness
11 Core Mechanisms (Detailed)
11.1 Low-Rank Perturbation Theory
The mathematical foundation of EGGROLL rests on replacing the standard ES gradient estimator with a structured variant.
Standard ES gradient:
∇_θ E[F(θ + σε)] = (1/σ) E[F(θ + σε) · ε], ε ~ N(0, I_d)
EGGROLL gradient (for matrix parameter W ∈ R^{m×n}):
∇_W E[F(W + σ·BA^T)] = (1/σ) E[F(W + σ·BA^T) · BA^T], A ~ N(0, I_m), B ~ N(0, I_n)
Consistency Theorem (informal): As the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate at rate O(1/r), where r is the perturbation rank.
Linearizing Effect: The paper proves that in high dimensions, the objective function locally linearizes around the current parameters, meaning rank-1 perturbations capture the dominant gradient direction. This is analogous to how random projections preserve distances in high dimensions (Johnson-Lindenstrauss lemma).
11.2 Arithmetic Intensity Analysis
The key to EGGROLL's speedup is arithmetic intensity — the ratio of compute (FLOPS) to memory bandwidth:
Operation FLOPS Memory Bytes Arithmetic Intensity
─────────────────────────────────────────────────────────────────────────────────
Standard matmul (m×k × k×n): 2mkn 2(mk + kn + mn) mkn / (mk+kn+mn)
≈ k (for m=n=k)
Batched matmul (N × m×k × k×n): 2Nmkn 2N(mk + kn + mn) Same per-element
But N separate kernel launches → low GPU utilization
EGGROLL decomposition:
x @ W^T (shared): 2mkn 2(mk + kn + mn) ≈ k (high)
x @ B (batched, B ∈ R^{n×r}): 2Nkr 2N(kr + nr + kr) ≈ min(k,r) (ok)
(xB) @ A^T (batched): 2Nmr 2N(mr + mr + mr) ≈ r/3 (fast for r=1)
For typical transformer hidden dimensions (k = 4096) and rank r = 1: - Standard matmul: intensity ≈ 4096 (excellent) - EGGROLL perturbation: intensity ≈ 1–4096 (architecture-dependent, but much better than batched full-rank)
11.3 Population Scaling Dynamics
EGGROLL enables population sizes 3 orders of magnitude larger than prior ES work:
Population Size Regimes
OpenAI ES (2017): N ≈ 1,000 │ Full-rank perturbations
│ Small NNs (MuJoCo)
│
ES-LLM (2025): N ≈ 10 │ Small population, many rollouts
│ per member to reduce variance
│
EGGROLL: N ≈ 1,000,000 │ Rank-1 perturbations
│ Billion-parameter models
│ High throughput
Theoretical implication:
- Gradient estimate variance ∝ 1/N
- At N = 10^6, variance is 1000x lower than N = 10^3
- This enables stable updates with larger learning rates
11.4 Noise-Reuse for Sequence Processing
For language modeling, EGGROLL incorporates Noise-Reuse ES (Vicol et al., 2023):
Standard ES on sequences:
Each token position: new perturbation → O(T) perturbation samples
Memory: O(T × d) per population member
Noise-Reuse ES:
Reuse the same perturbation across multiple token positions
Take multiple parameter updates within a single sequence
Memory: O(d) per population member (independent of T)
Timeline for one sequence (T=100 tokens):
┌──────┬──────┬──────┬──────┬──────┐
│tok 1 │tok 25│tok 50│tok 75│tok100│
│perturb│ │update│ │update│
│ ε_1 │ ε_1 │ ε_2 │ ε_2 │ done │
└──────┴──────┴──────┴──────┴──────┘
Same perturbation reused
Update after every K tokens
11.5 Integer Arithmetic as Nonlinearity
The most conceptually striking mechanism in the paper is the use of integer overflow as a source of nonlinearity:
Float32 behavior: Int8 behavior:
3.0 × 50.0 = 150.0 (linear) 3 × 50 = 150 → 127 (saturated!)
3.0 × 100.0 = 300.0 (linear) 3 × 100 = 300 → 44 (overflow!)
Int8 multiplication chain:
Input (int8) → Matmul → Accumulate (int32) → Cast (int8)
↑
Implicit nonlinearity!
Values > 127 wrap/saturate
Creates sigmoid-like behavior
This is inspired by prior OpenAI work showing that "nonlinear computation in deep linear networks" can emerge from floating-point rounding. EGGROLL takes this further: int8's extreme quantization makes the nonlinearity pronounced enough to train useful models.
11.6 Update Rule for EGG (Integer)
The EGG model uses a specialized update rule for integer parameters:
def egg_update(W_int8, gradient_estimate, threshold=1):
"""Integer-compatible parameter update."""
# Threshold: only update if gradient is large enough
update = jnp.where(
jnp.abs(gradient_estimate) > threshold,
jnp.sign(gradient_estimate), # Step by ±1 in int8
0
)
return jnp.clip(W_int8 + update, -128, 127).astype(jnp.int8)
Properties:
- No learning rate (step size is always ±1 in int8 space)
- No momentum or optimizer state
- Threshold prevents noise from dominating updates
- clip ensures int8 range
12 Programming Language
Implementation Stack
| Component | Language | Framework | Why |
|---|---|---|---|
| EGGROLL core | Python | JAX | vmap for population parallelism, XLA compilation |
| EGG model | Python | JAX | int8 tensor operations via jnp |
| RWKV-7 model | Python | JAX (jaxrwkv) | Custom RWKV port for JAX |
| Throughput benchmarks | Python | JAX + CUDA | Hardware-level profiling |
| Nano-egg | Python | JAX | Single-file implementation |
Why JAX?
JAX is the enabling technology for EGGROLL. Several JAX features are critical:
jax.vmap: Automatically vectorizes the forward pass over the population dimension. Without vmap, implementing population parallelism would require explicit batching code.
# Without vmap: explicit loop (slow)
for i in range(N):
y_i = forward(params, x[i], perturbation[i])
# With vmap: automatic vectorization (fast)
batched_forward = jax.vmap(forward, in_axes=(None, 0, 0))
y = batched_forward(params, x, perturbation_ids)
jax.random.fold_in: Deterministic PRNG that allows per-member perturbation generation without communication:
# Each member gets a unique but deterministic perturbation
key_i = jax.random.fold_in(base_key, thread_id)
perturbation_i = jax.random.normal(key_i, param.shape)
-
XLA compilation: JAX's ahead-of-time compilation fuses the EGGROLL operations into efficient GPU kernels, avoiding Python overhead.
-
Integer support: JAX supports int8 tensor operations via
jnp.int8, enabling the EGG experiments. -
Multi-device: JAX's
pmapenables multi-GPU parallelism with minimal code changes.
Why Not PyTorch?
The paper's authors chose JAX specifically because:
- PyTorch's vmap (functorch) is less mature than JAX's
- PyTorch's PRNG system doesn't have an equivalent to fold_in for deterministic per-member perturbations
- JAX's XLA compilation provides better kernel fusion for the EGGROLL computation pattern
- PyTorch's int8 support is primarily for inference quantization, not training
Code Organization
HyperscaleES/
├── eggroll/
│ ├── core.py # EGGROLL algorithm implementation
│ ├── perturbation.py # Low-rank perturbation generation
│ ├── update.py # Parameter update rules
│ └── utils.py # Fitness normalization, logging
├── models/
│ ├── egg.py # EGG (int8 GRU) architecture
│ ├── rwkv.py # RWKV-7 wrapper for EGGROLL
│ └── mlp.py # Simple MLP for RL experiments
├── tasks/
│ ├── countdown.py # Countdown reasoning task
│ ├── gsm8k.py # GSM8K evaluation
│ ├── language_model.py # Character-level LM (MiniPile)
│ └── rl_envs.py # RL environment wrappers
├── configs/
│ ├── egg_minipile.yaml # EGG pretraining config
│ ├── rwkv_countdown.yaml # RWKV countdown config
│ └── rwkv_gsm8k.yaml # RWKV GSM8K config
└── scripts/
├── train.py # Main training script
├── benchmark.py # Throughput benchmarking
└── evaluate.py # Model evaluation
13 Memory Management
Perturbation Memory: The Key Innovation
The most important memory optimization in EGGROLL is that perturbations are never stored — they are regenerated from RNG seeds:
Naive ES memory:
N population members × d parameters × 4 bytes (float32)
= N × d × 4 bytes
For N = 10^6, d = 7×10^9 (7B model):
= 7 × 10^15 bytes = 7 PB ← Obviously impossible!
EGGROLL memory:
N population members × 1 seed (8 bytes)
+ shared model parameters (d × 2 bytes for float16)
For N = 10^6, d = 7×10^9:
= 8 MB (seeds) + 14 GB (model) = 14.008 GB ← Fits on one GPU!
Gradient Accumulation Memory
The gradient is accumulated online without storing all perturbations:
# Memory-efficient gradient accumulation
gradient = jnp.zeros_like(W)
for batch_of_members in chunks(range(N), chunk_size):
# Generate perturbations on-the-fly
perturbations = vmap(generate_perturbation)(keys[batch_of_members])
# Evaluate fitness
fitnesses = vmap(evaluate)(perturbations)
# Accumulate gradient
gradient += jnp.einsum('i,ijk->jk', fitnesses, perturbations)
Peak memory: model_size + chunk_size × perturbation_overhead
Memory Budget Breakdown (7B RWKV on 8× H100)
| Component | Per-GPU Memory | Total |
|---|---|---|
| Model weights (float16, replicated) | 14 GB | 14 GB × 8 |
| Input data buffer | 2 GB | 2 GB × 8 |
| Activation memory (per chunk) | 8 GB | 8 GB × 8 |
| Perturbation overhead (per chunk) | 0.5 GB | 0.5 GB × 8 |
| Gradient accumulator | 14 GB | 14 GB × 8 |
| Framework overhead | 2 GB | 2 GB × 8 |
| Total | ~40.5 GB | — |
| Available (H100 80GB) | 80 GB | — |
| Headroom | ~39.5 GB | — |
Memory Comparison: EGGROLL vs. Backprop (Adam)
| Component | Backprop + Adam | EGGROLL |
|---|---|---|
| Model weights | 14 GB (7B × 2B) | 14 GB (same) |
| Gradients | 14 GB | 0 (computed online) |
| Adam momentum (m) | 14 GB | 0 (no optimizer state) |
| Adam variance (v) | 14 GB | 0 (no optimizer state) |
| Activations (for backward) | 20-40 GB | 0 (no backward pass) |
| Perturbation seeds | 0 | 8 MB |
| Gradient accumulator | 0 | 14 GB |
| Total | 62-82 GB | ~28 GB |
EGGROLL uses approximately half the memory of Adam-based backprop training for the same model, primarily because it eliminates optimizer states and activation storage for the backward pass.
14 Continued Learning
Within-Run Learning
EGGROLL's learning dynamics within a single training run:
-
Population-based gradient estimation: Each step produces a gradient estimate from N fitness evaluations. Unlike backprop (deterministic gradient), ES gradient has variance ∝ 1/N.
-
No momentum (EGG variant): The EGG model uses no optimizer state — each update is independent. This means there's no "memory" of past gradients.
-
Standard optimizers (RWKV variant): For LLM fine-tuning, EGGROLL can be combined with standard optimizers (SGD with momentum, Adam) applied to the ES gradient estimate.
Cross-Task Transfer
EGGROLL itself is a training algorithm, not a model. Transfer learning occurs at the model level:
- Pre-trained RWKV-7 weights serve as the initialization for fine-tuning
- EGGROLL fine-tuning preserves the base model's capabilities while adapting to new tasks
- The same EGGROLL implementation works across tasks (countdown, GSM8K, RL) without modification
Meta-Learning Potential
EGGROLL opens several meta-learning possibilities not yet explored in the paper:
| Direction | Description | Status |
|---|---|---|
| Learned rank selection | Adapt rank r during training based on gradient quality | Not implemented |
| Adaptive σ | Learn the perturbation scale per-layer | Standard ES technique, applicable |
| Population scheduling | Vary N over training (large early, small late) | Not explored |
| Multi-fidelity ES | Use cheap fitness approximations early, expensive later | Not explored |
| Fitness shaping | Learn the fitness transformation function | Standard ES technique, applicable |
Relationship to LoRA and Continual Fine-Tuning
EGGROLL has an interesting relationship to LoRA-based continual learning:
LoRA fine-tuning:
W' = W + B·A^T (always low-rank, r typically 16-64)
Multiple tasks: merge adapters or keep separate
EGGROLL fine-tuning:
W' = W + Σ(f_i · B_i · A_i^T) (full-rank after aggregation)
Multiple tasks: standard multi-task fitness function
Key difference: EGGROLL's updates are full-rank, so there's no
"adapter" to merge — the model weights are directly modified.
This is more like full fine-tuning in terms of expressiveness,
but achieved through inference-speed forward passes only.
Self-Play and Multi-Agent Extensions
The authors mention ongoing work on multi-agent optimization:
"We think that EGGROLL has strong potential to directly optimize LLMs with multi-agent awareness, breaking the best-of-k curse of RL."
This connects to their prior work on Social Deduction LLMs (Among Us) and suggests future applications in cooperative/competitive multi-agent training where end-to-end optimization through agent interactions is desirable but backprop through the interaction is intractable.
15 Applications
15.1 LLM Post-Training for Reasoning
The most immediately practical application: replacing or supplementing GRPO/RLHF for LLM reasoning training.
| Advantage | Description |
|---|---|
| No reward model needed | Fitness is computed directly from task output |
| No KL penalty tuning | ES naturally stays near initialization (small σ) |
| Non-differentiable rewards | Can optimize for exact match, code execution, tool use |
| Chain-of-thought optimization | Fitness includes reasoning chain quality |
| Multi-turn optimization | Natural extension to dialogue/agent settings |
Target tasks: - Mathematical reasoning (GSM8K, MATH, Olympiad problems) - Code generation (HumanEval, MBPP, SWE-bench) - Tool use optimization - Multi-step agent workflows
15.2 Novel Architecture Exploration
EGGROLL enables training architectures that are impractical with backprop:
| Architecture Type | Why Backprop Fails | EGGROLL Solution |
|---|---|---|
| Pure integer NNs | No float gradients | Forward-pass only, int8 operations |
| Lookup-table layers | Non-differentiable | Black-box optimization |
| Discrete attention | Argmax not differentiable | Fitness-based selection |
| Spiking neural networks | Non-differentiable spikes | Direct fitness optimization |
| Neuromorphic architectures | Hardware-specific ops | Hardware-in-the-loop training |
| State machines + NNs | Discrete transitions | End-to-end fitness |
15.3 Neurosymbolic System Optimization
The paper highlights neurosymbolic optimization as a key future direction:
"We are particularly interested in the end-to-end optimization of neurosymbolic systems, since EGGROLL naturally handles nondifferentiable components within a model."
Neurosymbolic System (optimized end-to-end by EGGROLL):
Input → [Neural Encoder] → [Symbolic Reasoner] → [Neural Decoder] → Output
↑ ↑ ↑
Differentiable Non-differentiable Differentiable
↑ ↑ ↑
├──── EGGROLL optimizes ALL components simultaneously ────┤
Specific targets mentioned: - ROSA architecture for RWKV-8 (discrete memory system) - LLMs with external tool calls (non-differentiable tool invocations) - Code-writing agents (discrete code generation + execution)
15.4 Discrete Diffusion Model Training
The paper mentions discrete diffusion models as a target:
"We would also like to enable us to try EGGROLL on other LLMs, including Discrete Diffusion models for which the standard policy gradient theorem is technically intractable (due to the mask-based sampling procedure)."
Discrete diffusion models (MDLM, SEDD, etc.) generate text by iteratively demasking tokens. The masking/demasking procedure is non-differentiable, making standard policy gradients inapplicable. EGGROLL's black-box nature makes it directly applicable.
15.5 Hardware-Accelerated Integer Training
The EGG experiments suggest a new paradigm for hardware-efficient training:
Current Training Pipeline:
Model (float16/bfloat16) → Forward (float16) → Backward (float16) → Adam (float32)
GPU utilization: ~30-50% of peak FLOPS
EGGROLL Integer Pipeline:
Model (int8) → Forward (int8 matmul, int32 accum) → Fitness → Update (int8)
GPU utilization: ~80-91% of peak FLOPS (inference-speed)
H100 Peak FLOPS by Datatype:
float32: 67 TFLOPS
float16: 989 TFLOPS
bfloat16: 989 TFLOPS
int8: 1,979 TOPS ← 2x more than float16!
EGGROLL with int8 can theoretically achieve 2x the throughput
of float16 training, on top of the ~100x ES speedup.
15.6 Reinforcement Learning
While demonstrated only on standard benchmarks in this paper, EGGROLL is naturally suited to RL:
| RL Setting | EGGROLL Advantage |
|---|---|
| Sparse rewards | No gradient through sparse signal needed |
| Multi-agent | End-to-end team optimization |
| Real-world robotics | No simulator differentiability required |
| Safety-constrained | Fitness includes safety penalties directly |
| Sim-to-real transfer | Optimize directly in the target environment |
15.7 Scientific Discovery
EGGROLL could be applied to optimize scientific models where the evaluation is non-differentiable:
| Domain | Model | Fitness Function |
|---|---|---|
| Drug discovery | Molecular generator NN | Binding affinity (docking score) |
| Materials science | Crystal structure predictor | DFT energy (non-differentiable) |
| Protein design | Structure prediction NN | Experimental stability |
| Climate modeling | Neural weather model | Forecast accuracy |
15.8 Limitations and Boundary Conditions
| Limitation | Impact | Mitigation |
|---|---|---|
| Sample efficiency | ES needs many evaluations | Large population + high throughput |
| Gradient quality at rank-1 | May miss important directions | Increase rank r if needed |
| Exploration vs. exploitation | Fixed σ limits exploration | Adaptive σ scheduling |
| No second-order information | Cannot exploit curvature | Larger population compensates |
| Hardware requirements | H100 for headline results | Scales down to A100, less dramatically |
| JAX ecosystem | Smaller than PyTorch | PyTorch port underway |
| Transformer support | KV-cache memory issue | vLLM/Megatron port in progress |
15.9 Comparison with Related Training Paradigms
| Paradigm | Gradient | Memory | Throughput | Generality |
|---|---|---|---|---|
| Backpropagation | Exact | High (activations) | ~30% of inference | Differentiable only |
| REINFORCE | Noisy | Low | ~50% of inference | Any reward |
| PPO/GRPO | Noisy + baseline | Medium | ~30% of inference | Any reward |
| Standard ES | Noisy | Very high (perturbations) | ~1% of inference | Any fitness |
| EGGROLL | Noisy (low-rank) | Low | ~91% of inference | Any fitness |
| Zeroth-order methods | Finite differences | Low | ~50% of inference | Any function |
15.10 The EGGROLL Vision: Inference IS Training
The paper's most provocative claim is that EGGROLL fundamentally changes the relationship between inference and training:
Traditional view:
Training ≠ Inference
Training: forward + backward + optimizer = expensive
Inference: forward only = cheap
Cost ratio: Training / Inference ≈ 3-5x
EGGROLL view:
Training ≈ Inference
Training: forward + perturbation + fitness = almost as fast as inference
Inference: forward only
Cost ratio: Training / Inference ≈ 1.1x
Implication: Any system that can run batched inference can also TRAIN,
with only 10% overhead. This means:
- Edge devices can self-improve
- Inference servers can simultaneously fine-tune
- Any fitness function becomes a training signal
- The distinction between deployment and training dissolves
This vision, if realized at larger scale and across more architectures, would represent a fundamental shift in how ML systems are deployed and improved.
This analysis is based on the arXiv paper (2511.16652v2), the project website (eshyperscale.github.io), the open-source code repositories (HyperscaleES, nano-egg, jaxrwkv), and supplementary information from the AlphaXiv discussion page. EGGROLL represents a significant advance in making evolution strategies practical for billion-parameter model training, with implications spanning LLM post-training, novel architecture exploration, and the fundamental relationship between inference and learning.