EGGROLL: Hyperscale Evolution Strategies
Part P06: Evolutionary Scaling & Efficiency
Evolution strategies (ES) have long occupied an elegant but peripheral niche in the optimization landscape: theoretically appealing for their gradient-free generality, yet computationally impractical at the scale demanded by modern deep learning. EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning — dismantles this computational barrier. Published in November 2025 by a collaboration spanning the University of Oxford, MILA, and NVIDIA (arXiv:2511.16652), EGGROLL demonstrates that structuring perturbations as low-rank matrices transforms evolution strategies from a curiosity into a billion-parameter training method that operates at 91% of pure inference throughput on H100 hardware. This chapter provides a detailed technical analysis of the algorithm, its theoretical foundations, the novel EGG integer architecture it enables, and its implications for the relationship between inference and learning.
Key Contribution
EGGROLL replaces the unstructured Gaussian perturbations of standard evolution strategies with rank-$r$ structured perturbations, achieving a reported 100× speedup over naive ES for billion-parameter models on H100 GPUs. Crucially, while individual perturbations are low-rank, the aggregated population update recovers full-rank expressiveness — making EGGROLL strictly more expressive than LoRA-based ES approaches. This enables, for the first time, population sizes of $\sim\!10^6$ for billion-parameter LLM training at near-inference cost, and demonstrates pure integer (int8) neural network pretraining without any floating-point arithmetic or explicit activation functions.
29.1 Overview and Motivation
29.1.1 The Computational Barrier of Classical ES
Standard evolution strategies perturb each parameter of a model independently. For a weight matrix $W \in \mathbb{R}^{m \times n}$, a perturbation $\varepsilon \sim \mathcal{N}(0, I_{mn})$ requires storing $m \times n$ random values per population member and computing a batched matrix multiplication $x(W + \sigma\varepsilon)^\top$ for each member. At the scale of modern language models — billions of parameters, evaluated over millions of population members — this becomes prohibitively expensive. The batched multiplications exhibit low arithmetic intensity on GPU hardware: many random memory accesses, poor cache utilization, and far less than peak FLOPS utilization. For a 7B-parameter model with a population of $10^6$, naive ES would require approximately 7 petabytes of perturbation storage alone — obviously infeasible on any existing hardware.
29.1.2 Why Evolution Strategies Still Matter
Despite this computational disadvantage, evolution strategies possess properties that backpropagation-based training fundamentally cannot provide. ES requires only forward passes and scalar fitness evaluations — no differentiable loss function, no activation storage for backward passes, no optimizer state. This generality enables optimization of non-differentiable objectives, discrete architectures, integer-only computation, black-box fitness functions, and end-to-end optimization through non-differentiable components such as tool calls, code execution, or hardware-in-the-loop evaluation. The question EGGROLL addresses is not whether ES is theoretically valuable, but whether it can be made computationally viable at the scales that matter.
29.1.3 Research Context and Lineage
EGGROLL builds on three lines of prior work. First, OpenAI's Evolution Strategies (Salimans et al., 2017) demonstrated ES as a scalable alternative to reinforcement learning for small neural networks but could not extend to billion-parameter models. Second, Noise-Reuse ES (Vicol et al., 2023) introduced the technique of reusing perturbations across multiple token positions in sequence processing, reducing memory from $O(T \times d)$ to $O(d)$ per population member, where $T$ is sequence length and $d$ is parameter count. Third, LoRA (Hu et al., 2022) demonstrated that low-rank matrix decompositions are effective for parameter-efficient adaptation — EGGROLL borrows the structural insight but applies it to perturbations rather than adapters, a critical distinction explored in Section 29.3. The work was developed concurrently with ES-LLM (arXiv:2509.24372), which applies ES to LLM training with small populations ($\sim$10 members); EGGROLL takes the opposite approach, enabling massive populations ($\sim$10$^6$) through computational efficiency.
The work originates from the Foundations of Reinforcement Learning (FORL) group at Oxford, led by Jakob Foerster, with co-leads Bidipta Sarkar, Mattie Fellows, and Juan Agustin Duque. The collaboration with MILA (Aaron Courville's group) and NVIDIA brings scalability expertise. Sarkar's prior work on Social Deduction LLMs using RWKV for multi-agent game-playing directly motivated the choice of RWKV as the primary LLM architecture for experiments (arXiv:2511.16652, Section 1).
29.2 Architecture
29.2.1 System Overview
EGGROLL's architecture comprises four stages executed in sequence for each training step: (1) a population manager that generates deterministic rank-$r$ perturbations from RNG seeds, (2) a batched forward pass that decomposes the perturbed computation into a shared standard matrix multiplication plus fast per-member rank-$r$ operations, (3) fitness evaluation of each population member's output, and (4) a fused gradient estimation and parameter update that aggregates low-rank perturbations weighted by centered fitness into a full-rank update. The entire system is implemented in JAX, leveraging jax.vmap for population parallelism and XLA compilation for kernel fusion.
29.2.2 Parallelism Model
EGGROLL's memory model is the key to its scalability. Model weights are stored once in GPU memory (shared across all population members). Each population member is identified by a single integer thread_id; its perturbation is regenerated on-the-fly from a shared base RNG key via jax.random.fold_in(base_key, thread_id), costing $O(1)$ storage per member rather than $O(mn)$ for naive ES. On a single H100 GPU with 80 GB memory, a 7B model in float16 occupies approximately 14 GB, leaving ample headroom for thousands of population members per pass with gradient accumulation enabling millions.
For multi-GPU training, EGGROLL partitions the population across devices. Each GPU evaluates its share of members independently — no inter-GPU communication is needed during forward passes because perturbations are deterministically recomputable from the shared seed. Only scalar fitness values and accumulated gradient contributions are communicated via AllReduce. This communication pattern is dramatically lighter than the activation/gradient exchange required by data-parallel backpropagation (arXiv:2511.16652, Section 3).
29.2.3 Implementation Stack
The implementation is organized across three public repositories. The main library, ESHyperscale/HyperscaleES, contains the JAX-based EGGROLL algorithm, model wrappers, task definitions, and training scripts. The single-file ESHyperscale/nano-egg repository provides a minimal implementation of the EGG integer training experiment. The RWKV-7 model port to JAX is at bsarkar321/jaxrwkv, with the earliest EGGROLL prototype committed on August 13, 2025. All three repositories are open-source.
| Component | Technology | Rationale |
|---|---|---|
| Core algorithm | Python / JAX | jax.vmap for automatic population vectorization; XLA compilation for kernel fusion |
| RNG system | JAX splittable PRNG | fold_in enables deterministic, communication-free perturbation generation |
| LLM architecture | RWKV-7 (JAX port) | Fixed-size state (no growing KV-cache); predictable memory per population member |
| Integer operations | JAX jnp.int8 | Native int8 tensor operations for EGG experiments |
| Multi-GPU | jax.pmap | Data-parallel population partitioning with minimal communication |
The choice of JAX over PyTorch is deliberate: JAX's vmap is more mature than PyTorch's functorch equivalent, its PRNG system provides fold_in for deterministic per-member key derivation (no PyTorch equivalent), and XLA compilation fuses the EGGROLL computation pattern more effectively. The authors note that a PyTorch port is underway, along with Transformer support via vLLM/Megatron integration to address the KV-cache memory challenge (arXiv:2511.16652, Section 7).
Repository Structure and Implementation Status
The following table summarizes the approximate organization of the HyperscaleES repository as reported in the project documentation and inferred from repository browsing. Exact module names and file paths may vary across commits; readers should consult the repository directly for authoritative structure.
| Directory / Module | Purpose | Key Primitives Used |
|---|---|---|
| Core algorithm module | EGGROLL perturbation generation and gradient estimation | jax.random.fold_in, jax.random.normal |
| Model wrappers | RWKV-7, EGG (int8 GRU), and MLP model definitions | jax.vmap, jnp.int8 |
| Task definitions | Countdown, GSM8K, character-level LM, RL environments | Task-specific fitness functions |
| Config files | YAML configurations for EGG/MiniPile, RWKV/Countdown, RWKV/GSM8K | — |
| Training scripts | Main training entry point, throughput benchmarking, evaluation | jax.pmap for multi-GPU |
The jaxrwkv repository provides the JAX port of RWKV-7, wrapping the model so that EGGROLL can perturb the time-mixing and channel-mixing weight matrices in each RWKV block. The EGGROLL perturbation logic applies jax.vmap over the population dimension (thread IDs), while multi-GPU scaling uses jax.pmap to partition population members across devices.
| Feature | Status | Source |
|---|---|---|
| EGGROLL core algorithm (JAX) | Implemented and released | HyperscaleES repo |
| EGG int8 pretraining | Implemented and released | nano-egg repo |
| RWKV-7 JAX port + EGGROLL integration | Implemented and released | jaxrwkv repo |
Multi-GPU via jax.pmap | Implemented | Paper Section 3 |
| Noise-Reuse ES for sequences | Implemented | Paper Section 3.3 |
| PyTorch port | Listed as future work | Paper Section 7 |
| Transformer support (vLLM/Megatron) | Listed as future work | Paper Section 7 |
| Discrete diffusion model training | Listed as future direction | Paper Section 7 |
29.3 Core Algorithms
Notation and Conventions
The following conventions are used consistently throughout this section. All vectors are row vectors unless otherwise stated.
| Symbol | Definition | Dimensions |
|---|---|---|
| $W$ | Weight matrix for a single layer | $\mathbb{R}^{m \times n}$ ($m$ = output dim, $n$ = input dim) |
| $x$ | Input activation (row vector) | $\mathbb{R}^{1 \times n}$ |
| $y$ | Output activation | $\mathbb{R}^{1 \times m}$ |
| $A_i$ | Per-member perturbation factor (output side) | $\mathbb{R}^{m \times r}$ |
| $B_i$ | Per-member perturbation factor (input side) | $\mathbb{R}^{n \times r}$ |
| $r$ | Perturbation rank (typically $r = 1$) | Scalar |
| $\sigma$ | Perturbation scale (noise standard deviation) | Scalar $> 0$ |
| $N$ | Population size | Scalar |
| $F$ | Fitness function | $\mathbb{R}^d \to \mathbb{R}$ |
| $F_i$ | Raw fitness of population member $i$: $F(\theta_i)$ | Scalar |
| $\bar{F}$ | Mean fitness: $(1/N)\sum_{i=1}^{N} F_i$ | Scalar |
| $\tilde{F}_i$ | Centered fitness: $F_i - \bar{F}$ | Scalar |
The forward pass convention is $y = xW^\top$ (row-vector input, transposed weight), consistent with standard deep learning frameworks where nn.Linear stores weights as $(\text{out\_features}, \text{in\_features})$.
29.3.1 Standard ES Gradient Estimation
In standard evolution strategies, the gradient of a smoothed objective is estimated by perturbing parameters with isotropic Gaussian noise. For parameters $\theta \in \mathbb{R}^d$ and fitness function $F$, the ES gradient estimator is:
where $\theta$ is the current parameter vector, $\sigma > 0$ is the perturbation scale (noise standard deviation), $\varepsilon \sim \mathcal{N}(0, I_d)$ is an isotropic Gaussian perturbation vector of the same dimensionality as $\theta$, and $F : \mathbb{R}^d \to \mathbb{R}$ is the scalar fitness function. This identity follows from the log-derivative trick applied to the Gaussian density. In practice, the expectation is approximated by a finite population of $N$ members, and antithetic sampling (mirrored perturbations $\pm\varepsilon$) is typically used to reduce variance.
For a weight matrix $W \in \mathbb{R}^{m \times n}$, this requires sampling a full $m \times n$ perturbation matrix per population member and computing the batched multiplication $x(W + \sigma\varepsilon)^\top$ — the operation that becomes computationally prohibitive at scale.
29.3.2 EGGROLL: Low-Rank Structured Perturbations
EGGROLL replaces the unstructured perturbation $\varepsilon \in \mathbb{R}^{m \times n}$ with a rank-$r$ structured perturbation. For weight matrix $W \in \mathbb{R}^{m \times n}$, the perturbed weight is:
where $r$ is the perturbation rank (typically $r = 1$), and $A_i$, $B_i$ are independently sampled Gaussian matrices for the $i$-th population member. The product $A_i B_i^\top \in \mathbb{R}^{m \times n}$ is a rank-$r$ matrix with the same dimensions as $W$, requiring only $O((m + n)r)$ random samples per layer per population member instead of $O(mn)$.
The computational advantage emerges from decomposing the perturbed forward pass. Starting from Eq. (27.1), the output for population member $i$ with input $x \in \mathbb{R}^{1 \times n}$ is derived as follows:
Distributing the transpose and expanding:
where we used the identity $(A_i B_i^\top)^\top = B_i A_i^\top$. This yields the EGGROLL forward-pass decomposition:
This decomposition is the core of EGGROLL's speedup. The term $xW^\top$ is a standard (non-batched) matrix multiplication, identical for all population members and computed once. The per-member computation consists of two steps: first, $xB_i \in \mathbb{R}^{1 \times r}$ projects the input into a low-dimensional space (a scalar when $r = 1$); second, $(xB_i)A_i^\top \in \mathbb{R}^{1 \times m}$ maps back to the output dimension. When $r = 1$, this reduces to a scalar-vector multiplication — an outer product with high arithmetic intensity on modern GPUs. The bottleneck operation transforms from a batched full-rank matrix multiplication (low arithmetic intensity, poor GPU utilization) to a standard matmul plus batched vector operations (high arithmetic intensity, near-peak utilization).
29.3.3 Full-Rank Recovery via Population Aggregation
A critical theoretical concern is whether rank-$r$ perturbations limit the expressiveness of the parameter update. EGGROLL's key theoretical result demonstrates that this is not the case. Applying the ES gradient identity from Section 29.3.1 to the structured perturbation $A_i B_i^\top$, the aggregated update across the population is:
where $\tilde{F}_i = F_i - \bar{F}$ is the centered fitness of the $i$-th population member (centering by the population mean $\bar{F}$ is a standard variance-reduction technique in ES that does not bias the gradient estimate). Each term $A_i B_i^\top \in \mathbb{R}^{m \times n}$ has rank at most $r$, but the sum of $N$ such rank-$r$ matrices has rank at most $\min(Nr, \min(m, n))$. When $Nr \geq \min(m, n)$, the update $\Delta W$ can be full-rank. For $r = 1$ and a population of $N = 10^6$ with typical hidden dimensions of $m, n \leq 10^4$, this condition is satisfied by a wide margin — the update has access to the full $\min(m,n)$-dimensional gradient space.
The paper proves a consistency theorem (arXiv:2511.16652, Section 4 and Appendix): under regularity conditions on the fitness function $F$ (the precise assumptions, including smoothness requirements and moment bounds, are detailed in the paper's appendix), as the parameter dimension $d \to \infty$ with fixed perturbation rank $r$, the EGGROLL gradient estimate converges to the standard ES gradient estimate. The approximation error between the two estimators decreases as $O(1/r)$. Informally, this result relies on the observation that in high-dimensional parameter spaces, the fitness function is well-approximated by a linear function in a neighborhood of the current parameters, so even rank-1 perturbations capture the dominant gradient direction.
One informal way to understand this convergence is by loose analogy to random projection methods: just as random low-dimensional projections can approximately preserve geometric structure in high dimensions (cf. the Johnson-Lindenstrauss lemma), random low-rank perturbations can approximately capture gradient information. However, this analogy is illustrative rather than formal — the consistency theorem rests on the specific structure of the ES gradient estimator and the properties of Gaussian random matrices, not on distance-preservation arguments. The reader is referred to the paper's appendix for the complete proof.
29.3.4 Critical Distinction: EGGROLL vs. LoRA + ES
It is important to distinguish EGGROLL from the simpler approach of applying ES to optimize LoRA adapters. In LoRA-based ES, the learnable parameters are the low-rank factors themselves, so the cumulative parameter update is permanently constrained to low rank:
LoRA + ES restricts the model to a low-rank subspace of weight updates. EGGROLL uses low-rank perturbations as a computational device but accumulates them into full-rank parameter updates. This distinction is critical for pretraining, where full-rank updates are necessary to traverse the loss landscape effectively, versus fine-tuning, where low-rank adaptation may suffice. In EGGROLL, $A_i$ and $B_i$ are freshly sampled each step and never stored — they are the measurement instrument, not the learned quantity.
29.3.5 Arithmetic Intensity Analysis
The speedup is grounded in hardware arithmetic intensity — the ratio of floating-point operations (FLOPS) to memory bytes transferred. For a standard matrix multiplication of an $m \times k$ matrix by a $k \times n$ matrix, arithmetic intensity is approximately $k$ for square matrices (high, well-suited to GPUs). Batched full-rank perturbation multiplications ($N$ independent $m \times k$ by $k \times n$ multiplies) achieve the same per-element intensity but require $N$ separate kernel launches or a single large batched kernel with poor memory locality.
EGGROLL's decomposition (Eq. 29.4) produces three operations with distinct intensity profiles: (1) one shared standard matmul ($x W^\top$, intensity $\approx k$), (2) one batched matmul with $B_i \in \mathbb{R}^{n \times r}$ ($x B_i$, intensity $\approx \min(k, r)$), and (3) one batched outer product ($(x B_i) A_i^\top$, intensity $\approx r/3$ for $r = 1$). The shared matmul dominates runtime and achieves peak GPU utilization. The per-member operations are fast and vectorizable via jax.vmap. The result: EGGROLL achieves 91% of pure batch inference throughput on H100 hardware, versus approximately 1% for naive ES at large population sizes (arXiv:2511.16652, Figure 2). These throughput figures are specific to the H100 GPU and the authors' JAX implementation; different hardware or frameworks would yield different absolute numbers, though the relative advantage of EGGROLL's decomposition is architecture-general.
29.3.6 Algorithm Pseudocode
The following pseudocode is author-written to illustrate the algorithm described in arXiv:2511.16652. It uses JAX primitives that are central to the actual implementation, but the function names, signatures, and exact code organization are simplified for pedagogical clarity. The actual repository (ESHyperscale/HyperscaleES) should be consulted for production-level implementation details.
# EGGROLL core algorithm — author-written pseudocode illustrating
# the approach described in arXiv:2511.16652.
# Actual repository code may differ in naming, structure, and optimization.
import jax
import jax.numpy as jnp
def generate_perturbation(base_key, thread_id, shape, rank=1):
"""Generate rank-r perturbation factors from a single RNG seed.
No storage required — perturbation is recomputable from (base_key, thread_id).
Uses jax.random.fold_in for deterministic, communication-free key derivation.
Returns:
A: (m, r) output-side perturbation factor
B: (n, r) input-side perturbation factor
such that A @ B.T is an (m, n) rank-r perturbation matrix.
"""
key = jax.random.fold_in(base_key, thread_id)
m, n = shape
# Sample (m + n) * r Gaussian values and split into A and B
params = jax.random.normal(key, (m + n, rank))
A = params[:m] # m x r (output-side factor)
B = params[m:] # n x r (input-side factor)
return A, B
def eggroll_linear(base_key, sigma, W, x, thread_id, rank=1):
"""EGGROLL-perturbed linear layer: shared matmul + per-member rank-r perturbation.
Implements Eq. (27.4): y = x W^T + sigma * (x B) A^T
where A B^T is the rank-r perturbation added to W.
"""
A, B = generate_perturbation(base_key, thread_id, W.shape, rank)
# Shared computation (computed once, same for all population members)
y_base = x @ W.T # (1, n) @ (n, m) → (1, m)
# Per-member perturbation (batched, high arithmetic intensity)
y_perturb = sigma * (x @ B) @ A.T # (1, n)@(n, r) → (1, r); (1, r)@(r, m) → (1, m)
return y_base + y_perturb
# Vectorize over population dimension using jax.vmap
# in_axes: base_key shared (None), sigma shared (None), W shared (None),
# x batched (0), thread_id batched (0)
batched_eggroll = jax.vmap(
eggroll_linear,
in_axes=(None, None, None, 0, 0)
)
# EGGROLL gradient estimation and parameter update
# Author-written pseudocode illustrating the fused update pattern
# described in arXiv:2511.16652. See HyperscaleES repo for actual implementation.
def eggroll_update(params, fitnesses, base_key, sigma, lr, rank=1):
"""Fuse gradient computation and parameter update.
Perturbations are regenerated on-the-fly from seeds — never stored.
The accumulated gradient is full-rank when N*r >= min(m, n).
Uses centered fitnesses (Eq. 29.5) to reduce variance without
biasing the gradient estimate.
"""
# Center fitnesses to reduce variance (standard ES technique)
centered_fitnesses = fitnesses - fitnesses.mean()
N = len(fitnesses)
updated_params = {}
for layer_name, W in params.items():
# Accumulate gradient online, regenerating perturbations from seeds
gradient = jnp.zeros_like(W) # (m, n) accumulator
for i in range(N):
A_i, B_i = generate_perturbation(base_key, i, W.shape, rank)
# Rank-r contribution weighted by centered fitness
# A_i @ B_i.T has shape (m, r) @ (r, n) = (m, n), matching W
gradient += centered_fitnesses[i] * (A_i @ B_i.T)
# Full-rank update: rank(gradient) <= min(N*r, min(m, n))
updated_params[layer_name] = W + lr * gradient / (N * sigma)
return updated_params
29.3.7 Noise-Reuse for Sequence Processing
For language modeling tasks involving long sequences, EGGROLL incorporates Noise-Reuse ES (Vicol et al., 2023). Standard ES on sequences would require a new perturbation at each token position, incurring $O(T \times d)$ memory per population member where $T$ is the sequence length. Noise-Reuse reuses the same perturbation across multiple token positions and takes intermediate parameter updates within a single sequence. This reduces per-member memory to $O(d)$, independent of sequence length. The paper reports using this technique for all language modeling experiments (arXiv:2511.16652, Section 3.3).
29.4 The EGG Architecture: Integer Neural Network Training
29.4.1 Architecture Design
EGGROLL enables a previously impossible experiment: training a nonlinear recurrent neural network entirely in int8 arithmetic — no floating-point operations anywhere in the forward pass. The architecture, called EGG (Evolved Generative GRU), is a modified minGRU with the following properties: all weights are int8, all computations use int8 matrix multiplication with int32 accumulation followed by cast back to int8, and no explicit activation functions are used (arXiv:2511.16652, Section 5). The model has dimension $D = 256$ and $L = 6$ layers.
29.4.2 Integer Overflow as Nonlinearity
The most conceptually striking mechanism in the paper is the use of integer arithmetic overflow as a source of nonlinearity. In standard float32 arithmetic, multiplication is linear: $3.0 \times 100.0 = 300.0$. In int8 arithmetic, values exceeding the range $[-128, 127]$ either saturate (clamp to the boundary) or wrap (modular arithmetic). This means that the chain int8 matmul → int32 accumulate → cast to int8 introduces an implicit nonlinear transformation without any explicit activation function:
where $x$ is the int32 accumulated result. This creates a piecewise-linear saturation behavior at the int8 boundaries, providing representational capacity that explicit activation functions normally supply. The paper cites prior OpenAI work showing that floating-point rounding can induce nonlinear computation in deep linear networks; EGGROLL extends this idea to the much more pronounced nonlinearity of int8 quantization (arXiv:2511.16652, Section 5.1). Whether the JAX implementation uses saturation semantics or wrap-around (modular) arithmetic depends on the specific int8 cast operation; both introduce nonlinearity, but with different functional forms.
29.4.3 Integer Parameter Update Rule
Because EGG operates entirely in int8, the parameter update rule is adapted accordingly. The following pseudocode illustrates the integer update logic described in the paper; the actual implementation in the nano-egg repository may differ in detail.
# EGG integer update rule — author-written pseudocode illustrating
# the approach described in arXiv:2511.16652, Section 5.
# See ESHyperscale/nano-egg repository for the actual implementation.
def egg_update(W_int8, gradient_estimate, threshold=1):
"""Integer-compatible parameter update for EGG.
No learning rate — step size is always ±1 in int8 space.
No momentum or optimizer state.
Threshold prevents noise-dominated updates.
"""
import jax.numpy as jnp
update = jnp.where(
jnp.abs(gradient_estimate) > threshold,
jnp.sign(gradient_estimate), # Step by ±1 in int8
0
)
return jnp.clip(W_int8 + update, -128, 127).astype(jnp.int8)
This update has three notable properties: there is no learning rate (the step size is always $\pm 1$ in int8 space), there is no optimizer state (no momentum, no Adam buffers), and the threshold prevents noise from dominating updates. This extreme simplicity is possible precisely because EGGROLL's large population ($N = 2^{20} \approx 10^6$) provides a high-quality gradient estimate, and the int8 parameter space is discrete with only 256 possible values per weight.
29.4.4 Hardware Implications
The H100 GPU achieves 1,979 TOPS for int8 operations — exactly twice the 989 TFLOPS available for float16/bfloat16. Because EGGROLL with int8 requires no backward pass, no float-precision optimizer state, and no activation storage, the EGG configuration achieves 10 million tokens per second on a single H100 with a population of $2^{20}$ members, as reported by the authors (arXiv:2511.16652, Section 5). The paper frames this as a demonstration that "inference IS training" — any hardware capable of int8 inference can, with EGGROLL, also train models at near-identical throughput. These throughput numbers are specific to H100 Tensor Cores and the authors' optimized JAX kernels.
29.5 Key Results
Experimental Caveats
All results in this section are reported by the paper's authors (arXiv:2511.16652) under specific hardware and software configurations. Key caveats for interpreting these results:
- Throughput numbers (Section 29.5.1) are measured on H100 GPUs with the authors' JAX implementation and XLA compilation. Absolute throughput will vary on different hardware, frameworks, and model architectures.
- Reasoning task comparisons (Section 29.5.3) compare EGGROLL with RWKV-7 against GRPO with different base models (LLaMA, Qwen). These are cross-architecture, cross-model comparisons — not controlled experiments with matched base models. The RWKV and Transformer architectures differ in capacity, pretraining data, and tokenization.
- Cost estimates (Section 29.6.2) are author-reported and based on standard cloud rental prices at the time of publication. Hardware generations differ between EGGROLL (H100) and some baselines (A100).
- None of these results have been independently reproduced at the time of writing, though the open-source code enables verification.
29.5.1 Throughput: Reported 100× Speedup
The headline result is throughput parity with inference. On an H100 GPU, the authors report that EGGROLL achieves 91% of pure batch inference throughput for billion-parameter models, while naive ES achieves approximately 1% at large population sizes. The following table summarizes the throughput comparison reported in the paper:
| Configuration | Tokens/Second | % of Inference | GPU Utilization |
|---|---|---|---|
| Pure batch inference | ~10M tok/s | 100% | ~95% |
| EGGROLL (rank-1) | ~9.1M tok/s | 91% | ~87% |
| Backprop training | ~3.3M tok/s | 33% | ~90% |
| Naive ES | ~0.1M tok/s | 1% | ~5% |
The reported 100× speedup over naive ES comes specifically from the replacement of batched full-rank matrix multiplications with the EGGROLL decomposition (Eq. 29.4). The 9% gap between EGGROLL and pure inference is due to the per-member rank-$r$ vector operations, which while fast, are not zero-cost. The throughput advantage depends on population size: at very small populations, the overhead of EGGROLL's decomposition offers less benefit; the 91% figure applies at the large population sizes ($N \sim 10^6$) that are EGGROLL's target regime.
29.5.2 EGG: Integer Language Model Pretraining
The EGG model achieves a test loss of 3.40 bits/byte on MiniPile (character-level), trained entirely in int8 with a population of $2^{20}$ members and 16 sequences shared across the population (arXiv:2511.16652, Section 5, Table 2). While this loss is not competitive with float-precision language models of comparable size, it demonstrates a previously impossible capability: training a nonlinear RNN without any floating-point arithmetic or explicit activation functions. The result is significant as an existence proof rather than a state-of-the-art language modeling result.
29.5.3 LLM Reasoning Tasks
On the Countdown task (constructing arithmetic expressions to reach a target number), EGGROLL with RWKV-7 achieves higher accuracy than the GRPO results reported in the concurrent ES-LLM paper:
| Model | Architecture | Method | Accuracy |
|---|---|---|---|
| LLaMA-3.2 1B | Transformer | GRPO (ES-LLM paper) | ~60% |
| RWKV 1.5B | Linear RNN | EGGROLL | ~65% |
| Qwen 2.5 1.5B | Transformer | GRPO (ES-LLM paper) | ~70% |
| RWKV 7B | Linear RNN | EGGROLL | ~80% |
| All 7B models | Transformer | GRPO (ES-LLM paper) | ~70% |
At the 7B scale, EGGROLL with RWKV-7 (80%) outperforms the best reported 7B Transformer results from the ES-LLM paper (~70%). However, these comparisons cross both the training method (EGGROLL vs. GRPO) and the base model (RWKV vs. LLaMA/Qwen Transformers). The higher accuracy could be attributable to EGGROLL's training method, RWKV-7's architecture, the specific pretraining data of the RWKV-7 checkpoint, or some combination. At the 1.5B scale, EGGROLL with RWKV (65%) exceeds LLaMA-3.2 1B with GRPO (60%) but falls below Qwen 2.5 1.5B with GRPO (70%), further illustrating the difficulty of separating method effects from model effects.
On GSM8K (grade school math), the paper reports that EGGROLL with RWKV 1.5B outperforms GRPO with the same RWKV 1.5B model (arXiv:2511.16652, Section 6). This same-model comparison is more controlled and provides stronger evidence for EGGROLL's effectiveness as a training method, though the exact accuracy figures and training budgets should be consulted in the paper for precise comparison. In tabula rasa RL settings (standard benchmarks), EGGROLL matches naive ES performance without the speed penalty.
29.5.4 Data Efficiency
An interesting finding concerns data sharing across the population. The paper compares two strategies: 512 population members sharing each sequence versus only 2 members sharing each sequence (paired). At large population sizes ($2^{20}$), both strategies achieve similar performance, suggesting that EGGROLL can extract useful gradient information even when many population members evaluate the same data. This is significant because it means the data throughput requirement does not scale linearly with population size (arXiv:2511.16652, Section 6.2).
29.6 Implementation Details: Cost, Compute, and Reproducibility
29.6.1 Memory Economics
EGGROLL's memory advantage over backprop-based training is substantial. The following comparison is for a 7B-parameter model, based on the analysis in arXiv:2511.16652, Section 3:
| Component | Backprop + Adam | EGGROLL |
|---|---|---|
| Model weights (float16) | 14 GB | 14 GB |
| Gradients | 14 GB | 0 (computed online) |
| Adam momentum ($m$) | 14 GB | 0 |
| Adam variance ($v$) | 14 GB | 0 |
| Activations (backward pass) | 20–40 GB | 0 |
| Perturbation seeds | 0 | ~8 MB |
| Gradient accumulator | 0 | 14 GB |
| Total | 62–82 GB | ~28 GB |
EGGROLL uses approximately half the memory of Adam-based training, primarily by eliminating optimizer states (momentum and variance buffers) and activation storage for the backward pass. The per-member perturbation overhead during computation is negligible and transient: approximately 0.01 GB per member for a 7B model (storing only the rank-$r$ factors $A_i, B_i$ during the forward pass, then discarding them and regenerating from the seed during the gradient accumulation pass).
29.6.2 Training Cost Estimates
EGGROLL makes ES cost-competitive with backprop for the first time at billion-parameter scale. The following estimates are reported or derived from the paper:
| Method | Hardware | Time | Est. Cloud Cost |
|---|---|---|---|
| GRPO (1.5B) | 4× A100 | ~4 hours | ~$50 |
| Naive ES (1.5B, pop=1K) | 4× A100 | ~400 hours | ~$5,000 |
| EGGROLL (1.5B, pop=1M) | 4× H100 | ~4 hours | ~$80 |
| GRPO (7B) | 8× A100 | ~12 hours | ~$300 |
| EGGROLL (7B, pop=1M) | 8× H100 | ~8 hours | ~$320 |
Provenance and caveats: The cloud cost estimates are author-reported and based on standard H100/A100 rental prices at the time of publication. The hardware configurations and wall-clock times are from the paper's experimental section. These are not independently verified benchmark numbers — they represent the authors' reported experience and cost modeling. Because EGGROLL runs on H100 GPUs (higher per-hour cost) while some baselines use A100 GPUs (lower per-hour cost), the dollar-cost comparison partially offsets EGGROLL's wall-clock advantage. A fully controlled comparison would require running all methods on the same hardware generation with matched training budgets.
29.6.3 Reproducibility Assessment
EGGROLL scores well on reproducibility. All core components are open-source: the EGGROLL algorithm (JAX) in HyperscaleES, the single-file EGG training in nano-egg, and the RWKV-7 JAX port in jaxrwkv. Pre-trained RWKV-7 weights are available on HuggingFace under Apache 2.0. The MiniPile dataset, Countdown task, and GSM8K are all publicly available. The algorithm is described in full mathematical detail with proofs in the paper's appendix.
The primary barrier to reproduction is hardware. Full paper reproduction requires access to H100 GPUs — the headline throughput numbers specifically depend on H100 Tensor Core performance. Minimum hardware requirements range from 1× A100 80GB for nano-egg training to 8× H100 for RWKV 7B fine-tuning and up to 64× H100 for full paper reproduction (arXiv:2511.16652, Section 7). The nano-egg repository explicitly encourages community contributions in the spirit of the nanogpt speedrun.
29.7 Comparative Analysis
29.7.1 EGGROLL in the ES Landscape
| System | Year | Perturbation | Population | Model Scale | Throughput |
|---|---|---|---|---|---|
| OpenAI ES (Salimans et al.) | 2017 | Full-rank | ~1,000 | Small NNs (MuJoCo) | Baseline |
| Uber ES (novelty + ES) | 2018 | Full-rank | ~1,000 | Small NNs (Atari) | ~1× baseline |
| ES-LLM | 2025 | Full-rank | ~10 | 1–7B LLMs | ~1× (avoids batched matmul) |
| LoRA + ES | 2025 | Low-rank adapters | ~100 | 1–7B LLMs | Moderate |
| EGGROLL | 2025 | Low-rank perturbations | ~1,000,000 | 1–7B LLMs | ~100× over naive ES (reported) |
The critical difference between EGGROLL and all prior ES work is the population scale. OpenAI ES and its successors were limited to populations of $\sim$1,000 members for small neural networks. ES-LLM compensates for small populations ($\sim$10) by using many rollouts per member to reduce variance. EGGROLL enables populations three orders of magnitude larger, which directly reduces gradient estimate variance by a factor of 1,000 (variance scales as $O(1/N)$ for $N$ i.i.d. perturbations), enabling stable updates with larger learning rates.
29.7.2 EGGROLL vs. Backprop-Based LLM Training
| Property | Backprop + Adam | GRPO | Standard ES | EGGROLL |
|---|---|---|---|---|
| Gradient type | Exact (autodiff) | Policy gradient | ES estimate (noisy) | ES estimate (low-rank, noisy) |
| Requires differentiable loss | Yes | Yes (for KL penalty) | No | No |
| Requires backward pass | Yes | Yes | No | No |
| Memory overhead | High (activations + optimizer) | Medium | Very high (perturbations) | Low |
| Throughput (% of inference) | ~33% | ~30% | ~1% | ~91% (reported, H100) |
| Population size | 1 (microbatch) | 16–64 | ~1,000 | ~1,000,000 |
| Integer-only training | No | No | Theoretically yes | Demonstrated (EGG) |
| Non-differentiable components | No | Limited | Yes | Yes |
29.7.3 Relationship to LLM-Powered Evolutionary Systems
It is important to distinguish EGGROLL from the LLM-powered evolutionary systems surveyed elsewhere in this book (AlphaEvolve, FunSearch, OpenEvolve, etc.). Those systems use LLMs as mutation operators — the LLM proposes code modifications that are then evaluated. EGGROLL is fundamentally different: it trains LLMs (or any neural network) via evolution strategies. The LLM is the object being optimized, not the optimizer:
This positions EGGROLL as complementary to rather than competitive with program synthesis systems. In principle, EGGROLL could be used to train the LLMs that serve as mutation operators in AlphaEvolve-style systems, particularly for non-differentiable reward signals such as code execution correctness.
29.8 RWKV-7 Integration and LLM Choice
EGGROLL's primary LLM experiments use RWKV-7 ("Goose"), a linear-attention recurrent model, rather than a Transformer architecture. This is a deliberate engineering choice, not an algorithmic limitation. RWKV-7 has constant memory per token during generation — its recurrent state is fixed-size regardless of sequence length. In contrast, Transformer models accumulate a growing KV-cache during autoregressive generation, creating memory management challenges when running thousands of parallel population members simultaneously. With EGGROLL's population sizes of $\sim$10$^6$, predictable per-member memory is critical for fitting within GPU memory budgets.
The RWKV-7 models used are pre-trained weights from HuggingFace (BlinkDL), available under Apache 2.0. The JAX port at bsarkar321/jaxrwkv wraps the model such that EGGROLL perturbs the time-mixing (linear attention) and channel-mixing (FFN variant) weights in each RWKV block. The authors note that Transformer support via vLLM/Megatron integration is in progress to remove this architectural restriction (arXiv:2511.16652, Section 7).
The following pseudocode illustrates how EGGROLL perturbations are applied to the RWKV-7 architecture. This is author-written pseudocode based on the algorithm description in the paper and the general structure of the jaxrwkv repository. The actual implementation's function names, parameter organization, and control flow may differ.
# RWKV-7 integration pattern for EGGROLL — author-written pseudocode
# illustrating how EGGROLL perturbations are applied to RWKV-7 blocks.
# Actual implementation in bsarkar321/jaxrwkv may differ in structure.
def eggroll_rwkv_forward(params, x, thread_id, base_key, sigma, rank=1):
"""Forward pass through RWKV-7 with EGGROLL perturbations.
Shared base computation + per-member rank-r perturbations
applied to time_mixing and channel_mixing weight matrices.
RWKV-7's fixed-size recurrent state is critical: unlike Transformers,
memory per population member does not grow with sequence length.
"""
hidden = embed(params['embedding'], x) # Token embedding (shared)
state = initial_state(params) # Fixed-size recurrent state
for block_idx in range(params['n_layers']):
block_params = params[f'block_{block_idx}']
# Time mixing (linear attention) — EGGROLL-perturbed
# Each block gets a unique sub-key for perturbation generation
block_key = jax.random.fold_in(base_key, block_idx)
hidden, state = eggroll_time_mixing(
block_key, sigma, block_params['time_mix'],
hidden, state, thread_id, rank
)
# Channel mixing (FFN variant) — EGGROLL-perturbed
channel_key = jax.random.fold_in(block_key, 1000000)
hidden = eggroll_channel_mixing(
channel_key, sigma, block_params['channel_mix'],
hidden, thread_id, rank
)
logits = hidden @ params['head'].T # Language model head
return logits, state
29.9 Applications and Future Directions
29.9.1 Immediate Applications
LLM post-training for reasoning. EGGROLL's most immediately practical application is replacing or supplementing GRPO/RLHF for reasoning tasks. Because it requires only a scalar fitness signal, EGGROLL can optimize for non-differentiable rewards such as exact-match correctness, code execution outcomes, or multi-step tool-use success without reward model training or KL penalty tuning. The Countdown and GSM8K results demonstrate competitive performance at 1.5B and 7B scales, though with the cross-model caveats noted in Section 29.5.3 (arXiv:2511.16652, Section 6).
Novel architecture exploration. EGGROLL enables training architectures that are fundamentally impractical with backpropagation: pure integer neural networks (demonstrated with EGG), lookup-table layers, discrete attention mechanisms, spiking neural networks, and neuromorphic hardware-in-the-loop systems. Any architecture that can perform a forward pass and produce an evaluable output can be trained with EGGROLL.
29.9.2 Research Directions Identified by the Authors
The paper identifies several unexplored research directions. Neurosymbolic optimization is highlighted as a key target: EGGROLL can optimize end-to-end through systems combining differentiable neural components with non-differentiable symbolic reasoners, discrete tool calls, or code execution. The authors specifically mention the ROSA architecture for RWKV-8 (which includes a discrete memory system) and LLMs with external tool calls as targets (arXiv:2511.16652, Section 7).
Discrete diffusion models are another target: these models generate text by iteratively demasking tokens, and the masking/demasking procedure is non-differentiable, making standard policy gradients technically intractable. EGGROLL's black-box fitness evaluation makes it directly applicable. Multi-agent optimization connects to the group's prior work on social deduction games, with the authors suggesting that EGGROLL could "directly optimize LLMs with multi-agent awareness, breaking the best-of-$k$ curse of RL" (arXiv:2511.16652, Section 7).
29.9.3 The "Inference IS Training" Vision
The paper's most provocative claim is that EGGROLL collapses the distinction between inference and training. Under traditional backpropagation, training costs 3–5× more than inference due to backward passes and optimizer states. With EGGROLL, the overhead is approximately 10% over pure inference on H100 hardware. This implies that any system capable of batched inference can simultaneously train with minimal additional cost: edge devices could self-improve, inference servers could continuously fine-tune, and deployment and training could become a single unified operation. While this vision is not yet realized at Transformer scale or for supervised pretraining, the EGG and RWKV results provide the first evidence that it may be achievable for specific architectures and tasks. Whether the 91% throughput ratio holds across different hardware platforms, model architectures, and population sizes remains an open empirical question.
29.10 Limitations and Open Questions
29.10.1 Known Limitations
| Limitation | Impact | Current Status / Mitigation |
|---|---|---|
| Sample efficiency | ES requires many fitness evaluations per parameter update | Compensated by high throughput (~91% of inference speed on H100) |
| Gradient quality at rank-1 | May miss important gradient directions in low dimensions | Increase rank $r$; consistency theorem guarantees improvement at rate $O(1/r)$ |
| No second-order information | Cannot exploit loss curvature | Larger population partially compensates; natural gradient variants possible |
| Transformer KV-cache | Growing KV-cache memory prevents massive populations | vLLM/Megatron port listed as future work; RWKV used as workaround |
| Hardware requirements | Headline results require H100 GPUs | Scales down to A100, but with less dramatic speedups |
| JAX ecosystem | Smaller user community than PyTorch | PyTorch port listed as underway |
| Supervised pretraining | Not demonstrated competitive with backprop for standard supervised learning | Not a target — EGGROLL is designed for tasks where backprop is insufficient |
| Cross-model evaluation | Reasoning results compare different base architectures (RWKV vs. Transformers) | Same-model comparison (RWKV GRPO vs. RWKV EGGROLL) shown only for GSM8K 1.5B |
29.10.2 Open Questions
Several meta-learning directions remain unexplored: adaptive rank selection (varying $r$ during training based on gradient quality), learned perturbation scale $\sigma$ per layer, population scheduling (large $N$ early, smaller late), and multi-fidelity fitness evaluation (cheap approximations early, expensive evaluation later). The paper provides a consistency theorem that is asymptotic ($d \to \infty$); it does not provide finite-sample convergence rates with dependence on problem-specific constants such as the Lipschitz constant of $F$ or the conditioning of the loss landscape. Empirical investigation of how EGGROLL's gradient quality degrades in moderate-dimensional settings (e.g., small layers with $\min(m,n) < 100$) would be valuable.
A fundamental open question is whether EGGROLL's approach extends to large-scale supervised pretraining. The current results focus on reasoning fine-tuning (where non-differentiable rewards justify the ES approach) and integer architecture pretraining (where backprop is unavailable). Whether EGGROLL can be competitive with backprop for standard differentiable objectives at pretraining scale remains undemonstrated and — as the authors acknowledge — is not the intended use case.
29.11 Summary
Key Takeaway
EGGROLL transforms evolution strategies from a computationally prohibitive theoretical curiosity into a practical training method for billion-parameter models by structuring perturbations as low-rank matrices. The resulting algorithm achieves a reported 91% of pure inference throughput on H100 hardware — a 100× speedup over naive ES — while preserving full-rank parameter updates through population aggregation.
Main Contribution to the Field
EGGROLL introduces rank-$r$ structured perturbations with a consistency theorem guaranteeing convergence to the standard ES gradient at rate $O(1/r)$ as parameter dimension grows. This enables population sizes of $\sim$10$^6$ for billion-parameter models, making ES cost-competitive with backprop-based methods for LLM reasoning training under the authors' reported experimental conditions. The EGG experiment further demonstrates pure int8 neural network pretraining — a capability fundamentally unavailable to gradient-based methods.
What a Researcher Should Know
EGGROLL is not a general replacement for backpropagation. Its value is in domains where backprop is impossible or insufficient: non-differentiable fitness functions, integer-only architectures, end-to-end optimization through discrete components, and training at inference speed. The key equations are: (1) the perturbation $W_{\text{perturbed}} = W + \sigma \, A_i B_i^\top$ with $A_i \in \mathbb{R}^{m \times r}$, $B_i \in \mathbb{R}^{n \times r}$, and (2) the forward-pass decomposition $y_i = xW^\top + \sigma(xB_i)A_i^\top$, which converts the ES population evaluation from batched full-rank matrix multiplications (GPU-unfriendly) into a shared standard matmul plus batched vector operations (GPU-friendly). The theoretical guarantee that low-rank perturbations aggregate into full-rank updates (Eq. 29.5) distinguishes EGGROLL from LoRA-based ES approaches and is the foundation of its expressiveness.