Introduced2025-11

Score7.73/10 — Draft

Chapter 29

EGGROLL: Hyperscale Evolution Strategies

Part P06: Evolutionary Scaling & Efficiency

Evolution strategies (ES) have long occupied an elegant but peripheral niche in the optimization landscape: theoretically appealing for their gradient-free generality, yet computationally impractical at the scale demanded by modern deep learning. EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning — dismantles this computational barrier. Published in November 2025 by a collaboration spanning the University of Oxford, MILA, and NVIDIA (arXiv:2511.16652), EGGROLL demonstrates that structuring perturbations as low-rank matrices transforms evolution strategies from a curiosity into a billion-parameter training method that operates at 91% of pure inference throughput on H100 hardware. This chapter provides a detailed technical analysis of the algorithm, its theoretical foundations, the novel EGG integer architecture it enables, and its implications for the relationship between inference and learning.

Key Contribution

EGGROLL replaces the unstructured Gaussian perturbations of standard evolution strategies with rank-$r$ structured perturbations, achieving a reported 100× speedup over naive ES for billion-parameter models on H100 GPUs. Crucially, while individual perturbations are low-rank, the aggregated population update recovers full-rank expressiveness — making EGGROLL strictly more expressive than LoRA-based ES approaches. This enables, for the first time, population sizes of $\sim\!10^6$ for billion-parameter LLM training at near-inference cost, and demonstrates pure integer (int8) neural network pretraining without any floating-point arithmetic or explicit activation functions.

29.1 Overview and Motivation

29.1.1 The Computational Barrier of Classical ES

Standard evolution strategies perturb each parameter of a model independently. For a weight matrix $W \in \mathbb{R}^{m \times n}$, a perturbation $\varepsilon \sim \mathcal{N}(0, I_{mn})$ requires storing $m \times n$ random values per population member and computing a batched matrix multiplication $x(W + \sigma\varepsilon)^\top$ for each member. At the scale of modern language models — billions of parameters, evaluated over millions of population members — this becomes prohibitively expensive. The batched multiplications exhibit low arithmetic intensity on GPU hardware: many random memory accesses, poor cache utilization, and far less than peak FLOPS utilization. For a 7B-parameter model with a population of $10^6$, naive ES would require approximately 7 petabytes of perturbation storage alone — obviously infeasible on any existing hardware.

29.1.2 Why Evolution Strategies Still Matter

Despite this computational disadvantage, evolution strategies possess properties that backpropagation-based training fundamentally cannot provide. ES requires only forward passes and scalar fitness evaluations — no differentiable loss function, no activation storage for backward passes, no optimizer state. This generality enables optimization of non-differentiable objectives, discrete architectures, integer-only computation, black-box fitness functions, and end-to-end optimization through non-differentiable components such as tool calls, code execution, or hardware-in-the-loop evaluation. The question EGGROLL addresses is not whether ES is theoretically valuable, but whether it can be made computationally viable at the scales that matter.

29.1.3 Research Context and Lineage

EGGROLL builds on three lines of prior work. First, OpenAI's Evolution Strategies (Salimans et al., 2017) demonstrated ES as a scalable alternative to reinforcement learning for small neural networks but could not extend to billion-parameter models. Second, Noise-Reuse ES (Vicol et al., 2023) introduced the technique of reusing perturbations across multiple token positions in sequence processing, reducing memory from $O(T \times d)$ to $O(d)$ per population member, where $T$ is sequence length and $d$ is parameter count. Third, LoRA (Hu et al., 2022) demonstrated that low-rank matrix decompositions are effective for parameter-efficient adaptation — EGGROLL borrows the structural insight but applies it to perturbations rather than adapters, a critical distinction explored in Section 29.3. The work was developed concurrently with ES-LLM (arXiv:2509.24372), which applies ES to LLM training with small populations ($\sim$10 members); EGGROLL takes the opposite approach, enabling massive populations ($\sim$10$^6$) through computational efficiency.

The work originates from the Foundations of Reinforcement Learning (FORL) group at Oxford, led by Jakob Foerster, with co-leads Bidipta Sarkar, Mattie Fellows, and Juan Agustin Duque. The collaboration with MILA (Aaron Courville's group) and NVIDIA brings scalability expertise. Sarkar's prior work on Social Deduction LLMs using RWKV for multi-agent game-playing directly motivated the choice of RWKV as the primary LLM architecture for experiments (arXiv:2511.16652, Section 1).

29.2 Architecture

29.2.1 System Overview

EGGROLL's architecture comprises four stages executed in sequence for each training step: (1) a population manager that generates deterministic rank-$r$ perturbations from RNG seeds, (2) a batched forward pass that decomposes the perturbed computation into a shared standard matrix multiplication plus fast per-member rank-$r$ operations, (3) fitness evaluation of each population member's output, and (4) a fused gradient estimation and parameter update that aggregates low-rank perturbations weighted by centered fitness into a full-rank update. The entire system is implemented in JAX, leveraging jax.vmap for population parallelism and XLA compilation for kernel fusion.

29.2.2 Parallelism Model

EGGROLL's memory model is the key to its scalability. Model weights are stored once in GPU memory (shared across all population members). Each population member is identified by a single integer thread_id; its perturbation is regenerated on-the-fly from a shared base RNG key via jax.random.fold_in(base_key, thread_id), costing $O(1)$ storage per member rather than $O(mn)$ for naive ES. On a single H100 GPU with 80 GB memory, a 7B model in float16 occupies approximately 14 GB, leaving ample headroom for thousands of population members per pass with gradient accumulation enabling millions.

For multi-GPU training, EGGROLL partitions the population across devices. Each GPU evaluates its share of members independently — no inter-GPU communication is needed during forward passes because perturbations are deterministically recomputable from the shared seed. Only scalar fitness values and accumulated gradient contributions are communicated via AllReduce. This communication pattern is dramatically lighter than the activation/gradient exchange required by data-parallel backpropagation (arXiv:2511.16652, Section 3).

29.2.3 Implementation Stack

The implementation is organized across three public repositories. The main library, ESHyperscale/HyperscaleES, contains the JAX-based EGGROLL algorithm, model wrappers, task definitions, and training scripts. The single-file ESHyperscale/nano-egg repository provides a minimal implementation of the EGG integer training experiment. The RWKV-7 model port to JAX is at bsarkar321/jaxrwkv, with the earliest EGGROLL prototype committed on August 13, 2025. All three repositories are open-source.

Table 29.1: Implementation technology choices and rationale (arXiv:2511.16652, Section 3)
Component	Technology	Rationale
Core algorithm	Python / JAX	`jax.vmap` for automatic population vectorization; XLA compilation for kernel fusion
RNG system	JAX splittable PRNG	`fold_in` enables deterministic, communication-free perturbation generation
LLM architecture	RWKV-7 (JAX port)	Fixed-size state (no growing KV-cache); predictable memory per population member
Integer operations	JAX `jnp.int8`	Native int8 tensor operations for EGG experiments
Multi-GPU	`jax.pmap`	Data-parallel population partitioning with minimal communication

The choice of JAX over PyTorch is deliberate: JAX's vmap is more mature than PyTorch's functorch equivalent, its PRNG system provides fold_in for deterministic per-member key derivation (no PyTorch equivalent), and XLA compilation fuses the EGGROLL computation pattern more effectively. The authors note that a PyTorch port is underway, along with Transformer support via vLLM/Megatron integration to address the KV-cache memory challenge (arXiv:2511.16652, Section 7).

Repository Structure and Implementation Status

The following table summarizes the approximate organization of the HyperscaleES repository as reported in the project documentation and inferred from repository browsing. Exact module names and file paths may vary across commits; readers should consult the repository directly for authoritative structure.

Table 29.1b: HyperscaleES repository organization (approximate, based on public repository inspection)
Directory / Module	Purpose	Key Primitives Used
Core algorithm module	EGGROLL perturbation generation and gradient estimation	`jax.random.fold_in`, `jax.random.normal`
Model wrappers	RWKV-7, EGG (int8 GRU), and MLP model definitions	`jax.vmap`, `jnp.int8`
Task definitions	Countdown, GSM8K, character-level LM, RL environments	Task-specific fitness functions
Config files	YAML configurations for EGG/MiniPile, RWKV/Countdown, RWKV/GSM8K	—
Training scripts	Main training entry point, throughput benchmarking, evaluation	`jax.pmap` for multi-GPU

The jaxrwkv repository provides the JAX port of RWKV-7, wrapping the model so that EGGROLL can perturb the time-mixing and channel-mixing weight matrices in each RWKV block. The EGGROLL perturbation logic applies jax.vmap over the population dimension (thread IDs), while multi-GPU scaling uses jax.pmap to partition population members across devices.

Table 29.1c: Feature implementation status (as described in arXiv:2511.16652, Sections 3 and 7)
Feature	Status	Source
EGGROLL core algorithm (JAX)	Implemented and released	HyperscaleES repo
EGG int8 pretraining	Implemented and released	nano-egg repo
RWKV-7 JAX port + EGGROLL integration	Implemented and released	jaxrwkv repo
Multi-GPU via `jax.pmap`	Implemented	Paper Section 3
Noise-Reuse ES for sequences	Implemented	Paper Section 3.3
PyTorch port	Listed as future work	Paper Section 7
Transformer support (vLLM/Megatron)	Listed as future work	Paper Section 7
Discrete diffusion model training	Listed as future direction	Paper Section 7

29.3 Core Algorithms

Notation and Conventions

The following conventions are used consistently throughout this section. All vectors are row vectors unless otherwise stated.

Symbol	Definition	Dimensions
$W$	Weight matrix for a single layer	$\mathbb{R}^{m \times n}$ ($m$ = output dim, $n$ = input dim)
$x$	Input activation (row vector)	$\mathbb{R}^{1 \times n}$
$y$	Output activation	$\mathbb{R}^{1 \times m}$
$A_i$	Per-member perturbation factor (output side)	$\mathbb{R}^{m \times r}$
$B_i$	Per-member perturbation factor (input side)	$\mathbb{R}^{n \times r}$
$r$	Perturbation rank (typically $r = 1$)	Scalar
$\sigma$	Perturbation scale (noise standard deviation)	Scalar $> 0$
$N$	Population size	Scalar
$F$	Fitness function	$\mathbb{R}^d \to \mathbb{R}$
$F_i$	Raw fitness of population member $i$: $F(\theta_i)$	Scalar
$\bar{F}$	Mean fitness: $(1/N)\sum_{i=1}^{N} F_i$	Scalar
$\tilde{F}_i$	Centered fitness: $F_i - \bar{F}$	Scalar

The forward pass convention is $y = xW^\top$ (row-vector input, transposed weight), consistent with standard deep learning frameworks where nn.Linear stores weights as $(\text{out\_features}, \text{in\_features})$.

29.3.1 Standard ES Gradient Estimation

In standard evolution strategies, the gradient of a smoothed objective is estimated by perturbing parameters with isotropic Gaussian noise. For parameters $\theta \in \mathbb{R}^d$ and fitness function $F$, the ES gradient estimator is:

$$\nabla_\theta \, \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I_d)}\big[F(\theta + \sigma\varepsilon)\big] = \frac{1}{\sigma}\,\mathbb{E}\big[F(\theta + \sigma\varepsilon) \cdot \varepsilon\big]$$

where $\theta$ is the current parameter vector, $\sigma > 0$ is the perturbation scale (noise standard deviation), $\varepsilon \sim \mathcal{N}(0, I_d)$ is an isotropic Gaussian perturbation vector of the same dimensionality as $\theta$, and $F : \mathbb{R}^d \to \mathbb{R}$ is the scalar fitness function. This identity follows from the log-derivative trick applied to the Gaussian density. In practice, the expectation is approximated by a finite population of $N$ members, and antithetic sampling (mirrored perturbations $\pm\varepsilon$) is typically used to reduce variance.

For a weight matrix $W \in \mathbb{R}^{m \times n}$, this requires sampling a full $m \times n$ perturbation matrix per population member and computing the batched multiplication $x(W + \sigma\varepsilon)^\top$ — the operation that becomes computationally prohibitive at scale.

29.3.2 EGGROLL: Low-Rank Structured Perturbations

EGGROLL replaces the unstructured perturbation $\varepsilon \in \mathbb{R}^{m \times n}$ with a rank-$r$ structured perturbation. For weight matrix $W \in \mathbb{R}^{m \times n}$, the perturbed weight is:

$$W_{\text{perturbed}} = W + \sigma \cdot A_i \, B_i^\top, \quad A_i \in \mathbb{R}^{m \times r}, \; B_i \in \mathbb{R}^{n \times r}, \; (A_i)_{jk}, (B_i)_{jk} \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1) \tag{27.1}$$

where $r$ is the perturbation rank (typically $r = 1$), and $A_i$, $B_i$ are independently sampled Gaussian matrices for the $i$-th population member. The product $A_i B_i^\top \in \mathbb{R}^{m \times n}$ is a rank-$r$ matrix with the same dimensions as $W$, requiring only $O((m + n)r)$ random samples per layer per population member instead of $O(mn)$.

The computational advantage emerges from decomposing the perturbed forward pass. Starting from Eq. (27.1), the output for population member $i$ with input $x \in \mathbb{R}^{1 \times n}$ is derived as follows:

$$y_i = x \, W_{\text{perturbed}}^\top = x\bigl(W + \sigma \, A_i B_i^\top\bigr)^\top \tag{27.2}$$

Distributing the transpose and expanding:

$$y_i = x \, W^\top + \sigma \, x \, (A_i B_i^\top)^\top = x \, W^\top + \sigma \, x \, B_i \, A_i^\top \tag{27.3}$$

where we used the identity $(A_i B_i^\top)^\top = B_i A_i^\top$. This yields the EGGROLL forward-pass decomposition:

$$\boxed{y_i = \underbrace{x \, W^\top}_{\text{shared (compute once)}} + \;\sigma \cdot \underbrace{(x \, B_i)}_{\in\, \mathbb{R}^{1 \times r}} \underbrace{A_i^\top}_{\in\, \mathbb{R}^{r \times m}}} \tag{27.4}$$

This decomposition is the core of EGGROLL's speedup. The term $xW^\top$ is a standard (non-batched) matrix multiplication, identical for all population members and computed once. The per-member computation consists of two steps: first, $xB_i \in \mathbb{R}^{1 \times r}$ projects the input into a low-dimensional space (a scalar when $r = 1$); second, $(xB_i)A_i^\top \in \mathbb{R}^{1 \times m}$ maps back to the output dimension. When $r = 1$, this reduces to a scalar-vector multiplication — an outer product with high arithmetic intensity on modern GPUs. The bottleneck operation transforms from a batched full-rank matrix multiplication (low arithmetic intensity, poor GPU utilization) to a standard matmul plus batched vector operations (high arithmetic intensity, near-peak utilization).

29.3.3 Full-Rank Recovery via Population Aggregation

A critical theoretical concern is whether rank-$r$ perturbations limit the expressiveness of the parameter update. EGGROLL's key theoretical result demonstrates that this is not the case. Applying the ES gradient identity from Section 29.3.1 to the structured perturbation $A_i B_i^\top$, the aggregated update across the population is:

$$\Delta W = \frac{1}{N\sigma} \sum_{i=1}^{N} \tilde{F}_i \cdot A_i \, B_i^\top \tag{27.5}$$

where $\tilde{F}_i = F_i - \bar{F}$ is the centered fitness of the $i$-th population member (centering by the population mean $\bar{F}$ is a standard variance-reduction technique in ES that does not bias the gradient estimate). Each term $A_i B_i^\top \in \mathbb{R}^{m \times n}$ has rank at most $r$, but the sum of $N$ such rank-$r$ matrices has rank at most $\min(Nr, \min(m, n))$. When $Nr \geq \min(m, n)$, the update $\Delta W$ can be full-rank. For $r = 1$ and a population of $N = 10^6$ with typical hidden dimensions of $m, n \leq 10^4$, this condition is satisfied by a wide margin — the update has access to the full $\min(m,n)$-dimensional gradient space.

The paper proves a consistency theorem (arXiv:2511.16652, Section 4 and Appendix): under regularity conditions on the fitness function $F$ (the precise assumptions, including smoothness requirements and moment bounds, are detailed in the paper's appendix), as the parameter dimension $d \to \infty$ with fixed perturbation rank $r$, the EGGROLL gradient estimate converges to the standard ES gradient estimate. The approximation error between the two estimators decreases as $O(1/r)$. Informally, this result relies on the observation that in high-dimensional parameter spaces, the fitness function is well-approximated by a linear function in a neighborhood of the current parameters, so even rank-1 perturbations capture the dominant gradient direction.

One informal way to understand this convergence is by loose analogy to random projection methods: just as random low-dimensional projections can approximately preserve geometric structure in high dimensions (cf. the Johnson-Lindenstrauss lemma), random low-rank perturbations can approximately capture gradient information. However, this analogy is illustrative rather than formal — the consistency theorem rests on the specific structure of the ES gradient estimator and the properties of Gaussian random matrices, not on distance-preservation arguments. The reader is referred to the paper's appendix for the complete proof.

29.3.4 Critical Distinction: EGGROLL vs. LoRA + ES

It is important to distinguish EGGROLL from the simpler approach of applying ES to optimize LoRA adapters. In LoRA-based ES, the learnable parameters are the low-rank factors themselves, so the cumulative parameter update is permanently constrained to low rank:

$$\text{LoRA + ES:} \quad W' = W + A \, B^\top \quad (\text{always rank-}r) \tag{27.6}$$

$$\text{EGGROLL:} \quad W' = W + \frac{1}{N\sigma}\sum_{i=1}^{N} \tilde{F}_i \cdot A_i \, B_i^\top \quad (\text{rank up to } \min(Nr, \min(m,n))) \tag{27.7}$$

LoRA + ES restricts the model to a low-rank subspace of weight updates. EGGROLL uses low-rank perturbations as a computational device but accumulates them into full-rank parameter updates. This distinction is critical for pretraining, where full-rank updates are necessary to traverse the loss landscape effectively, versus fine-tuning, where low-rank adaptation may suffice. In EGGROLL, $A_i$ and $B_i$ are freshly sampled each step and never stored — they are the measurement instrument, not the learned quantity.

29.3.5 Arithmetic Intensity Analysis

The speedup is grounded in hardware arithmetic intensity — the ratio of floating-point operations (FLOPS) to memory bytes transferred. For a standard matrix multiplication of an $m \times k$ matrix by a $k \times n$ matrix, arithmetic intensity is approximately $k$ for square matrices (high, well-suited to GPUs). Batched full-rank perturbation multiplications ($N$ independent $m \times k$ by $k \times n$ multiplies) achieve the same per-element intensity but require $N$ separate kernel launches or a single large batched kernel with poor memory locality.

EGGROLL's decomposition (Eq. 29.4) produces three operations with distinct intensity profiles: (1) one shared standard matmul ($x W^\top$, intensity $\approx k$), (2) one batched matmul with $B_i \in \mathbb{R}^{n \times r}$ ($x B_i$, intensity $\approx \min(k, r)$), and (3) one batched outer product ($(x B_i) A_i^\top$, intensity $\approx r/3$ for $r = 1$). The shared matmul dominates runtime and achieves peak GPU utilization. The per-member operations are fast and vectorizable via jax.vmap. The result: EGGROLL achieves 91% of pure batch inference throughput on H100 hardware, versus approximately 1% for naive ES at large population sizes (arXiv:2511.16652, Figure 2). These throughput figures are specific to the H100 GPU and the authors' JAX implementation; different hardware or frameworks would yield different absolute numbers, though the relative advantage of EGGROLL's decomposition is architecture-general.

29.3.6 Algorithm Pseudocode

The following pseudocode is author-written to illustrate the algorithm described in arXiv:2511.16652. It uses JAX primitives that are central to the actual implementation, but the function names, signatures, and exact code organization are simplified for pedagogical clarity. The actual repository (ESHyperscale/HyperscaleES) should be consulted for production-level implementation details.

# EGGROLL core algorithm — author-written pseudocode illustrating
# the approach described in arXiv:2511.16652.
# Actual repository code may differ in naming, structure, and optimization.
import jax
import jax.numpy as jnp

def generate_perturbation(base_key, thread_id, shape, rank=1):
    """Generate rank-r perturbation factors from a single RNG seed.
    
    No storage required — perturbation is recomputable from (base_key, thread_id).
    Uses jax.random.fold_in for deterministic, communication-free key derivation.
    
    Returns:
        A: (m, r) output-side perturbation factor
        B: (n, r) input-side perturbation factor
    such that A @ B.T is an (m, n) rank-r perturbation matrix.
    """
    key = jax.random.fold_in(base_key, thread_id)
    m, n = shape
    # Sample (m + n) * r Gaussian values and split into A and B
    params = jax.random.normal(key, (m + n, rank))
    A = params[:m]   # m x r (output-side factor)
    B = params[m:]   # n x r (input-side factor)
    return A, B

def eggroll_linear(base_key, sigma, W, x, thread_id, rank=1):
    """EGGROLL-perturbed linear layer: shared matmul + per-member rank-r perturbation.
    
    Implements Eq. (27.4): y = x W^T + sigma * (x B) A^T
    where A B^T is the rank-r perturbation added to W.
    """
    A, B = generate_perturbation(base_key, thread_id, W.shape, rank)
    
    # Shared computation (computed once, same for all population members)
    y_base = x @ W.T                   # (1, n) @ (n, m) → (1, m)
    
    # Per-member perturbation (batched, high arithmetic intensity)
    y_perturb = sigma * (x @ B) @ A.T  # (1, n)@(n, r) → (1, r); (1, r)@(r, m) → (1, m)
    
    return y_base + y_perturb

# Vectorize over population dimension using jax.vmap
# in_axes: base_key shared (None), sigma shared (None), W shared (None),
#          x batched (0), thread_id batched (0)
batched_eggroll = jax.vmap(
    eggroll_linear, 
    in_axes=(None, None, None, 0, 0)
)

# EGGROLL gradient estimation and parameter update
# Author-written pseudocode illustrating the fused update pattern
# described in arXiv:2511.16652. See HyperscaleES repo for actual implementation.

def eggroll_update(params, fitnesses, base_key, sigma, lr, rank=1):
    """Fuse gradient computation and parameter update.
    
    Perturbations are regenerated on-the-fly from seeds — never stored.
    The accumulated gradient is full-rank when N*r >= min(m, n).
    
    Uses centered fitnesses (Eq. 29.5) to reduce variance without
    biasing the gradient estimate.
    """
    # Center fitnesses to reduce variance (standard ES technique)
    centered_fitnesses = fitnesses - fitnesses.mean()
    N = len(fitnesses)
    
    updated_params = {}
    for layer_name, W in params.items():
        # Accumulate gradient online, regenerating perturbations from seeds
        gradient = jnp.zeros_like(W)         # (m, n) accumulator
        for i in range(N):
            A_i, B_i = generate_perturbation(base_key, i, W.shape, rank)
            # Rank-r contribution weighted by centered fitness
            # A_i @ B_i.T has shape (m, r) @ (r, n) = (m, n), matching W
            gradient += centered_fitnesses[i] * (A_i @ B_i.T)
        
        # Full-rank update: rank(gradient) <= min(N*r, min(m, n))
        updated_params[layer_name] = W + lr * gradient / (N * sigma)
    
    return updated_params

29.3.7 Noise-Reuse for Sequence Processing

For language modeling tasks involving long sequences, EGGROLL incorporates Noise-Reuse ES (Vicol et al., 2023). Standard ES on sequences would require a new perturbation at each token position, incurring $O(T \times d)$ memory per population member where $T$ is the sequence length. Noise-Reuse reuses the same perturbation across multiple token positions and takes intermediate parameter updates within a single sequence. This reduces per-member memory to $O(d)$, independent of sequence length. The paper reports using this technique for all language modeling experiments (arXiv:2511.16652, Section 3.3).

29.4 The EGG Architecture: Integer Neural Network Training

29.4.1 Architecture Design

EGGROLL enables a previously impossible experiment: training a nonlinear recurrent neural network entirely in int8 arithmetic — no floating-point operations anywhere in the forward pass. The architecture, called EGG (Evolved Generative GRU), is a modified minGRU with the following properties: all weights are int8, all computations use int8 matrix multiplication with int32 accumulation followed by cast back to int8, and no explicit activation functions are used (arXiv:2511.16652, Section 5). The model has dimension $D = 256$ and $L = 6$ layers.

29.4.2 Integer Overflow as Nonlinearity

The most conceptually striking mechanism in the paper is the use of integer arithmetic overflow as a source of nonlinearity. In standard float32 arithmetic, multiplication is linear: $3.0 \times 100.0 = 300.0$. In int8 arithmetic, values exceeding the range $[-128, 127]$ either saturate (clamp to the boundary) or wrap (modular arithmetic). This means that the chain int8 matmul → int32 accumulate → cast to int8 introduces an implicit nonlinear transformation without any explicit activation function:

$$\text{int8\_cast}(x) = \text{clip}(x, -128, 127) \quad \text{(saturation mode)} \tag{27.8}$$

where $x$ is the int32 accumulated result. This creates a piecewise-linear saturation behavior at the int8 boundaries, providing representational capacity that explicit activation functions normally supply. The paper cites prior OpenAI work showing that floating-point rounding can induce nonlinear computation in deep linear networks; EGGROLL extends this idea to the much more pronounced nonlinearity of int8 quantization (arXiv:2511.16652, Section 5.1). Whether the JAX implementation uses saturation semantics or wrap-around (modular) arithmetic depends on the specific int8 cast operation; both introduce nonlinearity, but with different functional forms.

29.4.3 Integer Parameter Update Rule

Because EGG operates entirely in int8, the parameter update rule is adapted accordingly. The following pseudocode illustrates the integer update logic described in the paper; the actual implementation in the nano-egg repository may differ in detail.

# EGG integer update rule — author-written pseudocode illustrating
# the approach described in arXiv:2511.16652, Section 5.
# See ESHyperscale/nano-egg repository for the actual implementation.

def egg_update(W_int8, gradient_estimate, threshold=1):
    """Integer-compatible parameter update for EGG.
    
    No learning rate — step size is always ±1 in int8 space.
    No momentum or optimizer state.
    Threshold prevents noise-dominated updates.
    """
    import jax.numpy as jnp
    
    update = jnp.where(
        jnp.abs(gradient_estimate) > threshold,
        jnp.sign(gradient_estimate),  # Step by ±1 in int8
        0
    )
    return jnp.clip(W_int8 + update, -128, 127).astype(jnp.int8)

This update has three notable properties: there is no learning rate (the step size is always $\pm 1$ in int8 space), there is no optimizer state (no momentum, no Adam buffers), and the threshold prevents noise from dominating updates. This extreme simplicity is possible precisely because EGGROLL's large population ($N = 2^{20} \approx 10^6$) provides a high-quality gradient estimate, and the int8 parameter space is discrete with only 256 possible values per weight.

29.4.4 Hardware Implications

The H100 GPU achieves 1,979 TOPS for int8 operations — exactly twice the 989 TFLOPS available for float16/bfloat16. Because EGGROLL with int8 requires no backward pass, no float-precision optimizer state, and no activation storage, the EGG configuration achieves 10 million tokens per second on a single H100 with a population of $2^{20}$ members, as reported by the authors (arXiv:2511.16652, Section 5). The paper frames this as a demonstration that "inference IS training" — any hardware capable of int8 inference can, with EGGROLL, also train models at near-identical throughput. These throughput numbers are specific to H100 Tensor Cores and the authors' optimized JAX kernels.

29.5 Key Results

Experimental Caveats

All results in this section are reported by the paper's authors (arXiv:2511.16652) under specific hardware and software configurations. Key caveats for interpreting these results:

Throughput numbers (Section 29.5.1) are measured on H100 GPUs with the authors' JAX implementation and XLA compilation. Absolute throughput will vary on different hardware, frameworks, and model architectures.
Reasoning task comparisons (Section 29.5.3) compare EGGROLL with RWKV-7 against GRPO with different base models (LLaMA, Qwen). These are cross-architecture, cross-model comparisons — not controlled experiments with matched base models. The RWKV and Transformer architectures differ in capacity, pretraining data, and tokenization.
Cost estimates (Section 29.6.2) are author-reported and based on standard cloud rental prices at the time of publication. Hardware generations differ between EGGROLL (H100) and some baselines (A100).
None of these results have been independently reproduced at the time of writing, though the open-source code enables verification.

29.5.1 Throughput: Reported 100× Speedup

The headline result is throughput parity with inference. On an H100 GPU, the authors report that EGGROLL achieves 91% of pure batch inference throughput for billion-parameter models, while naive ES achieves approximately 1% at large population sizes. The following table summarizes the throughput comparison reported in the paper:

Table 29.2: Author-reported throughput comparison on H100 GPU (arXiv:2511.16652, Figure 2). All figures are from the paper's experiments with RWKV-7 models using the authors' JAX implementation.
Configuration	Tokens/Second	% of Inference	GPU Utilization
Pure batch inference	~10M tok/s	100%	~95%
EGGROLL (rank-1)	~9.1M tok/s	91%	~87%
Backprop training	~3.3M tok/s	33%	~90%
Naive ES	~0.1M tok/s	1%	~5%

The reported 100× speedup over naive ES comes specifically from the replacement of batched full-rank matrix multiplications with the EGGROLL decomposition (Eq. 29.4). The 9% gap between EGGROLL and pure inference is due to the per-member rank-$r$ vector operations, which while fast, are not zero-cost. The throughput advantage depends on population size: at very small populations, the overhead of EGGROLL's decomposition offers less benefit; the 91% figure applies at the large population sizes ($N \sim 10^6$) that are EGGROLL's target regime.

29.5.2 EGG: Integer Language Model Pretraining

The EGG model achieves a test loss of 3.40 bits/byte on MiniPile (character-level), trained entirely in int8 with a population of $2^{20}$ members and 16 sequences shared across the population (arXiv:2511.16652, Section 5, Table 2). While this loss is not competitive with float-precision language models of comparable size, it demonstrates a previously impossible capability: training a nonlinear RNN without any floating-point arithmetic or explicit activation functions. The result is significant as an existence proof rather than a state-of-the-art language modeling result.

29.5.3 LLM Reasoning Tasks

On the Countdown task (constructing arithmetic expressions to reach a target number), EGGROLL with RWKV-7 achieves higher accuracy than the GRPO results reported in the concurrent ES-LLM paper:

Table 29.3: Countdown task accuracy (arXiv:2511.16652, Table 3). **Cross-model comparison:** EGGROLL uses RWKV-7 while GRPO baselines use Transformer models (LLaMA, Qwen). Different architectures, pretraining data, and tokenizers make direct comparison imprecise.
Model	Architecture	Method	Accuracy
LLaMA-3.2 1B	Transformer	GRPO (ES-LLM paper)	~60%
RWKV 1.5B	Linear RNN	EGGROLL	~65%
Qwen 2.5 1.5B	Transformer	GRPO (ES-LLM paper)	~70%
RWKV 7B	Linear RNN	EGGROLL	~80%
All 7B models	Transformer	GRPO (ES-LLM paper)	~70%

At the 7B scale, EGGROLL with RWKV-7 (80%) outperforms the best reported 7B Transformer results from the ES-LLM paper (~70%). However, these comparisons cross both the training method (EGGROLL vs. GRPO) and the base model (RWKV vs. LLaMA/Qwen Transformers). The higher accuracy could be attributable to EGGROLL's training method, RWKV-7's architecture, the specific pretraining data of the RWKV-7 checkpoint, or some combination. At the 1.5B scale, EGGROLL with RWKV (65%) exceeds LLaMA-3.2 1B with GRPO (60%) but falls below Qwen 2.5 1.5B with GRPO (70%), further illustrating the difficulty of separating method effects from model effects.

On GSM8K (grade school math), the paper reports that EGGROLL with RWKV 1.5B outperforms GRPO with the same RWKV 1.5B model (arXiv:2511.16652, Section 6). This same-model comparison is more controlled and provides stronger evidence for EGGROLL's effectiveness as a training method, though the exact accuracy figures and training budgets should be consulted in the paper for precise comparison. In tabula rasa RL settings (standard benchmarks), EGGROLL matches naive ES performance without the speed penalty.

29.5.4 Data Efficiency

An interesting finding concerns data sharing across the population. The paper compares two strategies: 512 population members sharing each sequence versus only 2 members sharing each sequence (paired). At large population sizes ($2^{20}$), both strategies achieve similar performance, suggesting that EGGROLL can extract useful gradient information even when many population members evaluate the same data. This is significant because it means the data throughput requirement does not scale linearly with population size (arXiv:2511.16652, Section 6.2).

29.6 Implementation Details: Cost, Compute, and Reproducibility

29.6.1 Memory Economics

EGGROLL's memory advantage over backprop-based training is substantial. The following comparison is for a 7B-parameter model, based on the analysis in arXiv:2511.16652, Section 3:

Table 29.4: Memory comparison — EGGROLL vs. Adam-based backprop for 7B model (arXiv:2511.16652, Section 3). EGGROLL gradient accumulator size assumes float16 precision matching the model weights.
Component	Backprop + Adam	EGGROLL
Model weights (float16)	14 GB	14 GB
Gradients	14 GB	0 (computed online)
Adam momentum ($m$)	14 GB	0
Adam variance ($v$)	14 GB	0
Activations (backward pass)	20–40 GB	0
Perturbation seeds	0	~8 MB
Gradient accumulator	0	14 GB
Total	62–82 GB	~28 GB

EGGROLL uses approximately half the memory of Adam-based training, primarily by eliminating optimizer states (momentum and variance buffers) and activation storage for the backward pass. The per-member perturbation overhead during computation is negligible and transient: approximately 0.01 GB per member for a 7B model (storing only the rank-$r$ factors $A_i, B_i$ during the forward pass, then discarding them and regenerating from the seed during the gradient accumulation pass).

29.6.2 Training Cost Estimates

EGGROLL makes ES cost-competitive with backprop for the first time at billion-parameter scale. The following estimates are reported or derived from the paper:

Table 29.5: Cost comparison for LLM reasoning training (arXiv:2511.16652, Section 7). Cloud cost estimates are author-reported based on standard rental prices at time of publication. Note that EGGROLL uses H100 GPUs while some baselines use A100 GPUs, making direct cost comparison approximate.
Method	Hardware	Time	Est. Cloud Cost
GRPO (1.5B)	4× A100	~4 hours	~$50
Naive ES (1.5B, pop=1K)	4× A100	~400 hours	~$5,000
EGGROLL (1.5B, pop=1M)	4× H100	~4 hours	~$80
GRPO (7B)	8× A100	~12 hours	~$300
EGGROLL (7B, pop=1M)	8× H100	~8 hours	~$320

Provenance and caveats: The cloud cost estimates are author-reported and based on standard H100/A100 rental prices at the time of publication. The hardware configurations and wall-clock times are from the paper's experimental section. These are not independently verified benchmark numbers — they represent the authors' reported experience and cost modeling. Because EGGROLL runs on H100 GPUs (higher per-hour cost) while some baselines use A100 GPUs (lower per-hour cost), the dollar-cost comparison partially offsets EGGROLL's wall-clock advantage. A fully controlled comparison would require running all methods on the same hardware generation with matched training budgets.

29.6.3 Reproducibility Assessment

EGGROLL scores well on reproducibility. All core components are open-source: the EGGROLL algorithm (JAX) in HyperscaleES, the single-file EGG training in nano-egg, and the RWKV-7 JAX port in jaxrwkv. Pre-trained RWKV-7 weights are available on HuggingFace under Apache 2.0. The MiniPile dataset, Countdown task, and GSM8K are all publicly available. The algorithm is described in full mathematical detail with proofs in the paper's appendix.

The primary barrier to reproduction is hardware. Full paper reproduction requires access to H100 GPUs — the headline throughput numbers specifically depend on H100 Tensor Core performance. Minimum hardware requirements range from 1× A100 80GB for nano-egg training to 8× H100 for RWKV 7B fine-tuning and up to 64× H100 for full paper reproduction (arXiv:2511.16652, Section 7). The nano-egg repository explicitly encourages community contributions in the spirit of the nanogpt speedrun.

29.7 Comparative Analysis

29.7.1 EGGROLL in the ES Landscape

Table 29.6: Evolution strategies at scale — historical comparison (arXiv:2511.16652, Table 1). All throughput figures are relative to each system's own baseline hardware.
System	Year	Perturbation	Population	Model Scale	Throughput
OpenAI ES (Salimans et al.)	2017	Full-rank	~1,000	Small NNs (MuJoCo)	Baseline
Uber ES (novelty + ES)	2018	Full-rank	~1,000	Small NNs (Atari)	~1× baseline
ES-LLM	2025	Full-rank	~10	1–7B LLMs	~1× (avoids batched matmul)
LoRA + ES	2025	Low-rank adapters	~100	1–7B LLMs	Moderate
EGGROLL	2025	Low-rank perturbations	~1,000,000	1–7B LLMs	~100× over naive ES (reported)

The critical difference between EGGROLL and all prior ES work is the population scale. OpenAI ES and its successors were limited to populations of $\sim$1,000 members for small neural networks. ES-LLM compensates for small populations ($\sim$10) by using many rollouts per member to reduce variance. EGGROLL enables populations three orders of magnitude larger, which directly reduces gradient estimate variance by a factor of 1,000 (variance scales as $O(1/N)$ for $N$ i.i.d. perturbations), enabling stable updates with larger learning rates.

29.7.2 EGGROLL vs. Backprop-Based LLM Training

Table 29.7: Training paradigm comparison (arXiv:2511.16652, Section 2). Throughput percentages are author-reported on H100 hardware.
Property	Backprop + Adam	GRPO	Standard ES	EGGROLL
Gradient type	Exact (autodiff)	Policy gradient	ES estimate (noisy)	ES estimate (low-rank, noisy)
Requires differentiable loss	Yes	Yes (for KL penalty)	No	No
Requires backward pass	Yes	Yes	No	No
Memory overhead	High (activations + optimizer)	Medium	Very high (perturbations)	Low
Throughput (% of inference)	~33%	~30%	~1%	~91% (reported, H100)
Population size	1 (microbatch)	16–64	~1,000	~1,000,000
Integer-only training	No	No	Theoretically yes	Demonstrated (EGG)
Non-differentiable components	No	Limited	Yes	Yes

29.7.3 Relationship to LLM-Powered Evolutionary Systems

It is important to distinguish EGGROLL from the LLM-powered evolutionary systems surveyed elsewhere in this book (AlphaEvolve, FunSearch, OpenEvolve, etc.). Those systems use LLMs as mutation operators — the LLM proposes code modifications that are then evaluated. EGGROLL is fundamentally different: it trains LLMs (or any neural network) via evolution strategies. The LLM is the object being optimized, not the optimizer:

$$\text{AlphaEvolve:} \quad \text{LLM} \xrightarrow{\text{generates}} \text{code mutations} \xrightarrow{\text{evaluates}} \text{fitness}$$

$$\text{EGGROLL:} \quad \text{random} \xrightarrow{\text{perturbs}} \text{LLM weights} \xrightarrow{\text{evaluates}} \text{fitness} \xrightarrow{\text{updates}} \text{LLM weights}$$

This positions EGGROLL as complementary to rather than competitive with program synthesis systems. In principle, EGGROLL could be used to train the LLMs that serve as mutation operators in AlphaEvolve-style systems, particularly for non-differentiable reward signals such as code execution correctness.

29.8 RWKV-7 Integration and LLM Choice

EGGROLL's primary LLM experiments use RWKV-7 ("Goose"), a linear-attention recurrent model, rather than a Transformer architecture. This is a deliberate engineering choice, not an algorithmic limitation. RWKV-7 has constant memory per token during generation — its recurrent state is fixed-size regardless of sequence length. In contrast, Transformer models accumulate a growing KV-cache during autoregressive generation, creating memory management challenges when running thousands of parallel population members simultaneously. With EGGROLL's population sizes of $\sim$10$^6$, predictable per-member memory is critical for fitting within GPU memory budgets.

The RWKV-7 models used are pre-trained weights from HuggingFace (BlinkDL), available under Apache 2.0. The JAX port at bsarkar321/jaxrwkv wraps the model such that EGGROLL perturbs the time-mixing (linear attention) and channel-mixing (FFN variant) weights in each RWKV block. The authors note that Transformer support via vLLM/Megatron integration is in progress to remove this architectural restriction (arXiv:2511.16652, Section 7).

The following pseudocode illustrates how EGGROLL perturbations are applied to the RWKV-7 architecture. This is author-written pseudocode based on the algorithm description in the paper and the general structure of the jaxrwkv repository. The actual implementation's function names, parameter organization, and control flow may differ.

# RWKV-7 integration pattern for EGGROLL — author-written pseudocode
# illustrating how EGGROLL perturbations are applied to RWKV-7 blocks.
# Actual implementation in bsarkar321/jaxrwkv may differ in structure.

def eggroll_rwkv_forward(params, x, thread_id, base_key, sigma, rank=1):
    """Forward pass through RWKV-7 with EGGROLL perturbations.
    
    Shared base computation + per-member rank-r perturbations
    applied to time_mixing and channel_mixing weight matrices.
    
    RWKV-7's fixed-size recurrent state is critical: unlike Transformers,
    memory per population member does not grow with sequence length.
    """
    hidden = embed(params['embedding'], x)  # Token embedding (shared)
    
    state = initial_state(params)  # Fixed-size recurrent state
    
    for block_idx in range(params['n_layers']):
        block_params = params[f'block_{block_idx}']
        
        # Time mixing (linear attention) — EGGROLL-perturbed
        # Each block gets a unique sub-key for perturbation generation
        block_key = jax.random.fold_in(base_key, block_idx)
        
        hidden, state = eggroll_time_mixing(
            block_key, sigma, block_params['time_mix'], 
            hidden, state, thread_id, rank
        )
        
        # Channel mixing (FFN variant) — EGGROLL-perturbed
        channel_key = jax.random.fold_in(block_key, 1000000)
        hidden = eggroll_channel_mixing(
            channel_key, sigma, block_params['channel_mix'],
            hidden, thread_id, rank
        )
    
    logits = hidden @ params['head'].T  # Language model head
    return logits, state

29.9 Applications and Future Directions

29.9.1 Immediate Applications

LLM post-training for reasoning. EGGROLL's most immediately practical application is replacing or supplementing GRPO/RLHF for reasoning tasks. Because it requires only a scalar fitness signal, EGGROLL can optimize for non-differentiable rewards such as exact-match correctness, code execution outcomes, or multi-step tool-use success without reward model training or KL penalty tuning. The Countdown and GSM8K results demonstrate competitive performance at 1.5B and 7B scales, though with the cross-model caveats noted in Section 29.5.3 (arXiv:2511.16652, Section 6).

Novel architecture exploration. EGGROLL enables training architectures that are fundamentally impractical with backpropagation: pure integer neural networks (demonstrated with EGG), lookup-table layers, discrete attention mechanisms, spiking neural networks, and neuromorphic hardware-in-the-loop systems. Any architecture that can perform a forward pass and produce an evaluable output can be trained with EGGROLL.

29.9.2 Research Directions Identified by the Authors

The paper identifies several unexplored research directions. Neurosymbolic optimization is highlighted as a key target: EGGROLL can optimize end-to-end through systems combining differentiable neural components with non-differentiable symbolic reasoners, discrete tool calls, or code execution. The authors specifically mention the ROSA architecture for RWKV-8 (which includes a discrete memory system) and LLMs with external tool calls as targets (arXiv:2511.16652, Section 7).

Discrete diffusion models are another target: these models generate text by iteratively demasking tokens, and the masking/demasking procedure is non-differentiable, making standard policy gradients technically intractable. EGGROLL's black-box fitness evaluation makes it directly applicable. Multi-agent optimization connects to the group's prior work on social deduction games, with the authors suggesting that EGGROLL could "directly optimize LLMs with multi-agent awareness, breaking the best-of-$k$ curse of RL" (arXiv:2511.16652, Section 7).

29.9.3 The "Inference IS Training" Vision

The paper's most provocative claim is that EGGROLL collapses the distinction between inference and training. Under traditional backpropagation, training costs 3–5× more than inference due to backward passes and optimizer states. With EGGROLL, the overhead is approximately 10% over pure inference on H100 hardware. This implies that any system capable of batched inference can simultaneously train with minimal additional cost: edge devices could self-improve, inference servers could continuously fine-tune, and deployment and training could become a single unified operation. While this vision is not yet realized at Transformer scale or for supervised pretraining, the EGG and RWKV results provide the first evidence that it may be achievable for specific architectures and tasks. Whether the 91% throughput ratio holds across different hardware platforms, model architectures, and population sizes remains an open empirical question.

29.10 Limitations and Open Questions

29.10.1 Known Limitations

Table 29.8: EGGROLL limitations and current status (arXiv:2511.16652, Section 7)
Limitation	Impact	Current Status / Mitigation
Sample efficiency	ES requires many fitness evaluations per parameter update	Compensated by high throughput (~91% of inference speed on H100)
Gradient quality at rank-1	May miss important gradient directions in low dimensions	Increase rank $r$; consistency theorem guarantees improvement at rate $O(1/r)$
No second-order information	Cannot exploit loss curvature	Larger population partially compensates; natural gradient variants possible
Transformer KV-cache	Growing KV-cache memory prevents massive populations	vLLM/Megatron port listed as future work; RWKV used as workaround
Hardware requirements	Headline results require H100 GPUs	Scales down to A100, but with less dramatic speedups
JAX ecosystem	Smaller user community than PyTorch	PyTorch port listed as underway
Supervised pretraining	Not demonstrated competitive with backprop for standard supervised learning	Not a target — EGGROLL is designed for tasks where backprop is insufficient
Cross-model evaluation	Reasoning results compare different base architectures (RWKV vs. Transformers)	Same-model comparison (RWKV GRPO vs. RWKV EGGROLL) shown only for GSM8K 1.5B

29.10.2 Open Questions

Several meta-learning directions remain unexplored: adaptive rank selection (varying $r$ during training based on gradient quality), learned perturbation scale $\sigma$ per layer, population scheduling (large $N$ early, smaller late), and multi-fidelity fitness evaluation (cheap approximations early, expensive evaluation later). The paper provides a consistency theorem that is asymptotic ($d \to \infty$); it does not provide finite-sample convergence rates with dependence on problem-specific constants such as the Lipschitz constant of $F$ or the conditioning of the loss landscape. Empirical investigation of how EGGROLL's gradient quality degrades in moderate-dimensional settings (e.g., small layers with $\min(m,n) < 100$) would be valuable.

A fundamental open question is whether EGGROLL's approach extends to large-scale supervised pretraining. The current results focus on reasoning fine-tuning (where non-differentiable rewards justify the ES approach) and integer architecture pretraining (where backprop is unavailable). Whether EGGROLL can be competitive with backprop for standard differentiable objectives at pretraining scale remains undemonstrated and — as the authors acknowledge — is not the intended use case.

29.11 Summary

Key Takeaway

EGGROLL transforms evolution strategies from a computationally prohibitive theoretical curiosity into a practical training method for billion-parameter models by structuring perturbations as low-rank matrices. The resulting algorithm achieves a reported 91% of pure inference throughput on H100 hardware — a 100× speedup over naive ES — while preserving full-rank parameter updates through population aggregation.

Main Contribution to the Field

EGGROLL introduces rank-$r$ structured perturbations with a consistency theorem guaranteeing convergence to the standard ES gradient at rate $O(1/r)$ as parameter dimension grows. This enables population sizes of $\sim$10$^6$ for billion-parameter models, making ES cost-competitive with backprop-based methods for LLM reasoning training under the authors' reported experimental conditions. The EGG experiment further demonstrates pure int8 neural network pretraining — a capability fundamentally unavailable to gradient-based methods.

What a Researcher Should Know

EGGROLL is not a general replacement for backpropagation. Its value is in domains where backprop is impossible or insufficient: non-differentiable fitness functions, integer-only architectures, end-to-end optimization through discrete components, and training at inference speed. The key equations are: (1) the perturbation $W_{\text{perturbed}} = W + \sigma \, A_i B_i^\top$ with $A_i \in \mathbb{R}^{m \times r}$, $B_i \in \mathbb{R}^{n \times r}$, and (2) the forward-pass decomposition $y_i = xW^\top + \sigma(xB_i)A_i^\top$, which converts the ES population evaluation from batched full-rank matrix multiplications (GPU-unfriendly) into a shared standard matmul plus batched vector operations (GPU-friendly). The theoretical guarantee that low-rank perturbations aggregate into full-rank updates (Eq. 29.5) distinguishes EGGROLL from LoRA-based ES approaches and is the foundation of its expressiveness.

@misc{kinas2026evosurvey,
  author = {Kinas, Remek},
  title  = {Evolutionary AI Survey},
  year   = {2026},
  url    = {https://evo.si5.pl}
}