← Back to Index

EGGROLL — Evolution Strategies at the Hyperscale

Low-Rank Evolution Strategies Achieving 100x Speedup for Billion-Parameter Model Training Organization: University of Oxford, MILA, NVIDIA Published: November 20, 2025 Type: Paper (arXiv:2511.16652) + Blog + Code Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: Evolution Strategies at the Hyperscale

Algorithm Name: EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning

ArXiv: 2511.16652 (cs.LG / cs.AI)

Project Page: eshyperscale.github.io

Code Repositories: - Main library: ESHyperscale/HyperscaleES (JAX-based) - Single-file EGG training: ESHyperscale/nano-egg - RWKV-7 in JAX: bsarkar321/jaxrwkv

AlphaXiv Discussion: alphaxiv.org/abs/2511.16652

Submission History: - v1: November 20, 2025 - v2: February 16, 2026 (revised with extended experiments)

First Public Commit: August 13, 2025 — jaxrwkv commit 6d92566 (early EGGROLL prototype)

Lineage: Builds on OpenAI's Evolution Strategies (Salimans et al., 2017), Noise-Reuse ES (Vicol et al., 2023), and structural insights from LoRA (Hu et al., 2022). Concurrent with and compared against ES-LLM (arXiv:2509.24372).

2 Authors and Team

Core Authors

Author Affiliation Role Marker
Bidipta Sarkar Oxford / Stanford Co-lead, algorithm design, RWKV integration * (equal)
Mattie Fellows Oxford Co-lead, theoretical analysis * (equal)
Juan Agustin Duque MILA Co-lead, implementation * (equal)
Shimon Whiteson Oxford / Waymo Senior advisor * (equal)
Jakob Nicolaus Foerster Oxford Senior advisor, FORL group lead * (equal)
Aaron Courville MILA Senior advisor
Karin Sevegnani NVIDIA Industry advisor
Alexander David Goldie Oxford Contributing author

Contributing Authors (†)

Author Contribution Area
Alistair Letcher Theoretical convergence analysis
Antonio León Villares Implementation, benchmarking
Anya Sims EGG architecture design
Clarisse Wibault Experiments
Dmitry Samsonov GPU kernel optimization
Dylan Cope RL experiments
Jarek Liesen Infrastructure
Kang Li Throughput benchmarking
Lukas Seier Integer arithmetic experiments
Theo Wolf RWKV integration
Uljad Berdica Experiments
Valentin Mohl Theory

Research Group Context

The work originates primarily from the Foundations of Reinforcement Learning (FORL) group at Oxford, led by Jakob Foerster. The group has a track record in multi-agent RL, zero-shot coordination, and policy gradient methods. EGGROLL represents a deliberate pivot toward gradient-free optimization methods as a complement (and potential alternative) to backpropagation-based training. The collaboration with MILA (Aaron Courville's group) and NVIDIA brings scalability expertise and hardware acceleration knowledge.

Bidipta Sarkar's prior work on Social Deduction LLM (using RWKV for multi-agent Among Us) directly motivated the choice of RWKV as the LLM architecture for EGGROLL experiments.

3 Core Contribution

Key Novelty: EGGROLL eliminates the computational barrier between inference and training for evolution strategies by structuring perturbations as rank-r matrices, achieving a hundredfold speedup over naive ES for billion-parameter models while preserving full-rank parameter updates through population aggregation.

The Fundamental Problem

Standard Evolution Strategies (ES) perturb each parameter independently:

θ_perturbed = θ + σ · ε,   where ε ~ N(0, I_d)

For a weight matrix W ∈ R^{m×n}, this requires: - Storing m × n random numbers per population member - Computing a batched matrix multiplication: x @ (W + σ·ε)^T - GPU inefficiency: Batched matmuls with unstructured perturbations have low arithmetic intensity on modern GPUs (many random memory accesses, poor cache utilization)

At scale (billions of parameters, millions of population members), this becomes prohibitively slow — the batched matmuls dominate runtime and achieve far less than peak GPU FLOPS.

EGGROLL's Solution

Replace unstructured perturbations with rank-r structured perturbations:

θ_perturbed = θ + σ · B · A^T,   where A ∈ R^{m×r}, B ∈ R^{n×r}, sampled i.i.d. Gaussian

This transforms the computation:

NAIVE ES:   y = x @ (W + σ·ε)^T                     → Batched matmul (slow)
EGGROLL:    y = x @ W^T + σ · (x @ B) @ A^T          → Standard matmul + batched outer product (fast)

The key insight: x @ W^T is a standard (non-batched) matrix multiplication (same for all population members), and (x @ B) @ A^T is a batched vector-vector multiplication at rank 1, which has much higher arithmetic intensity.

Why This Works (Theoretical Guarantee)

Although individual perturbations are rank-r, the aggregated update across the population is full-rank:

Update = (1/σ) · E[F(θ + σ·BA^T) · BA^T]

Since BA^T is rank-r, but the sum of N rank-r matrices (for N population members)
has rank min(N·r, min(m,n)) — which is full-rank when N·r ≥ min(m,n).

The paper proves a consistency theorem: as the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate. The convergence rate is O(1/r), meaning even rank-1 perturbations provide useful gradient information.

Relationship to Prior Work

System Year Approach Population Size Scale Speed
OpenAI ES 2017 Full-rank perturbations ~1,000 MuJoCo (small NNs) Baseline
Uber ES 2018 Novelty search + ES ~1,000 Atari (small NNs) ~1x baseline
ES on LLMs 2025 Small population, many rollouts ~10 1-7B LLMs ~1x (avoids batched matmul)
LoRA + ES 2025 Low-rank adapters only ~100 1-7B LLMs Moderate
EGGROLL 2025 Low-rank perturbations, full-rank updates ~1,000,000 1-7B LLMs ~100x over naive ES

Critical Distinction: EGGROLL vs. LoRA + ES

Using ES to optimize LoRA adapters directly restricts the update to low-rank forever. EGGROLL uses low-rank perturbations but accumulates them into full-rank parameter updates, making it strictly more expressive:

LoRA + ES:    W' = W + B·A^T (always rank-r)
EGGROLL:      W' = W + Σ_i (B_i · A_i^T) · fitness_i  (rank up to N·r)

This distinction is critical for pretraining (requires full-rank updates) vs. fine-tuning (where low-rank may suffice).

4 Supported Solutions

EGGROLL is a general-purpose optimization algorithm applicable to any differentiable or non-differentiable objective. The paper demonstrates four solution domains:

Solution Domain Description Model Task
Pure integer pretraining Train a nonlinear RNN entirely in int8 EGG (Evolved Generative GRU) Character-level LM on MiniPile
LLM reasoning (fine-tuning) Post-training for mathematical reasoning RWKV-7 (1.5B, 7B) Countdown, GSM8K
Tabula rasa RL Standard RL benchmark training Small NNs MuJoCo-like environments
Non-differentiable optimization Systems with discrete/non-differentiable components Any architecture Any fitness function

What EGGROLL Enables That Backprop Cannot

Capability Backprop EGGROLL
Integer-only training No (requires float gradients) Yes (only needs forward pass)
Non-differentiable activations No Yes
Black-box fitness functions No (needs differentiable loss) Yes
Training without activation functions Impractical (vanishing gradients) Yes (demonstrated with EGG)
Optimizing discrete components Requires relaxation (Gumbel-Softmax, etc.) Direct optimization
Hardware-in-the-loop training No (non-differentiable hardware) Yes (just needs input → output)
Multi-agent end-to-end optimization Limited (credit assignment) Natural (fitness = team outcome)

What EGGROLL Does NOT Target

  • Supervised learning at scale — Backprop remains more sample-efficient for standard supervised tasks
  • Single-example gradient computation — EGGROLL requires a population, minimum batch size is the population
  • Low-latency training — EGGROLL's strength is throughput, not latency per update
  • Tasks where backprop works well — No reason to replace backprop for standard differentiable objectives

5 LLM Integration

EGGROLL as an LLM Training Method

Unlike systems that use LLMs as mutation operators (AlphaEvolve, FunSearch), EGGROLL trains LLMs directly via evolution strategies. The LLM is the object being optimized, not the optimizer.

AlphaEvolve:    LLM ──generates──► Code mutations ──evaluates──► Fitness
EGGROLL:        Random ──perturbs──► LLM weights ──evaluates──► Fitness ──updates──► LLM weights

LLM Architecture: RWKV-7

EGGROLL's primary LLM experiments use RWKV-7 ("Goose"), a linear-attention recurrent model:

Property Value Why Chosen
Architecture Linear RNN (RWKV-7) Constant memory per token during generation
Sizes tested 1.5B, 7B parameters Demonstrates billion-scale feasibility
Base model Pre-trained RWKV-7 Goose Reasoning traces already in pretraining data
Framework JAX Efficient vmap for population parallelism
KV cache Fixed size (unlike Transformers) No dynamic memory allocation during generation

Why not Transformers? The growing KV-cache in Transformer architectures creates memory management challenges when running thousands of parallel population members. RWKV's fixed-size state means memory is predictable and constant regardless of sequence length. This is a practical engineering constraint, not an algorithmic limitation — EGGROLL's math works with any architecture.

Fitness Functions for LLM Training

Task Fitness Function Details
Countdown Correctness of countdown sequence Binary reward: correct final answer or not
GSM8K Mathematical answer accuracy Binary reward: correct numerical answer
Pretraining (EGG) Cross-entropy loss Bits per byte on MiniPile test set
General RL Environment reward Task-specific cumulative reward

Comparison with GRPO

EGGROLL is positioned as a competitor to GRPO (Group Relative Policy Optimization) for LLM reasoning:

Method Gradient Type Requirements Population
GRPO Policy gradient (backprop) Differentiable loss, float arithmetic Group size ~16-64
EGGROLL ES gradient (forward-pass only) Any fitness function, any arithmetic Population up to ~10^6

6 Key Results

6.1 Throughput: 100x Speedup

The headline result — EGGROLL achieves up to 91% of pure batch inference throughput:

Throughput vs. Population Size (Billion-parameter model, H100)

Tokens/sec
(millions)
    │
 10 │ ●──●──●──●──●──●──●──●  Pure inference (upper bound)
    │ ○──○──○──○──○──○──○──○  EGGROLL (91% of inference)
  8 │
    │
  6 │
    │
  4 │
    │
  2 │                           ×
    │              ×
  0 │ ×──×──×  Naive ES (100x slower at large pop sizes)
    └─────────────────────────── Population size
      2^10  2^12  2^14  2^16  2^18  2^20

The 100x speedup comes from replacing batched matrix multiplications (low arithmetic intensity) with standard matrix multiplications plus batched vector operations (high arithmetic intensity).

6.2 EGG: Pure Integer Language Model Pretraining

EGGROLL enables a previously impossible experiment: training a nonlinear RNN entirely in int8:

Property Value
Architecture EGG (Evolved Generative GRU)
Dimensions D=256, L=6 layers
Datatypes All weights int8, computations int8 with int32 accumulation
Activation functions None (int8 overflow provides implicit nonlinearity)
Dataset MiniPile (character-level)
Tokens per second 10 million (single H100)
Population size 2^20 = 1,048,576
Test loss 3.40 bits/byte
Sequence sharing 16 sequences shared across population
Training Loss (EGG, bits/byte)
    │
 5.0│ ●
    │  ●
 4.5│   ●
    │    ●
 4.0│     ●●
    │       ●●
 3.5│         ●●●●●●
    │               ●●●●●●●●●●●●  → 3.40 bits/byte
 3.0│
    └──────────────────────────── Training steps
     0      500    1000   1500   2000

Key insight: The int8 tensor multiplication followed by int32 accumulation and cast back to int8 introduces implicit nonlinearity through overflow/saturation. This means no explicit activation functions are needed — the arithmetic format itself provides nonlinearity.

6.3 LLM Reasoning: Countdown Task

On the countdown task, EGGROLL outperforms GRPO and matches or exceeds concurrent ES-LLM results:

Model Method Accuracy
LLaMA-3.2 1B GRPO (ES-LLM paper) ~60%
RWKV 1.5B EGGROLL ~65%
Qwen 2.5 1.5B GRPO (ES-LLM paper) ~70%
RWKV 7B EGGROLL ~80%
All 7B models GRPO (ES-LLM paper) ~70%

EGGROLL with RWKV-7B outperforms all reported 7B results from the concurrent ES-LLM paper despite using a weaker base model (RWKV vs. LLaMA/Qwen).

6.4 LLM Reasoning: GSM8K

On GSM8K (grade school math), EGGROLL also outperforms GRPO:

Model Method Accuracy
RWKV 1.5B GRPO Baseline
RWKV 1.5B EGGROLL Outperforms GRPO

6.5 Data Efficiency

EGGROLL demonstrates surprising data efficiency at large population sizes:

Population Size vs. Data Sharing

Solid lines:  512 population members share each sequence
Dashed lines: Only 2 members share each sequence (paired)

At large population sizes (2^20), both strategies achieve similar
performance — suggesting that ES can extract useful gradient signal
even when many population members evaluate the same data.

6.6 Tabula Rasa RL

In standard RL settings, EGGROLL matches naive ES performance without speed compromise:

"EGGROLL does not compromise performance compared to ES in tabula rasa RL settings, despite being faster."

7 Reproducibility

Open-Source Status

Component Available Repository License
EGGROLL algorithm (JAX) Yes HyperscaleES Open
Nano-EGG (single file) Yes nano-egg Open
RWKV JAX implementation Yes jaxrwkv Open
Pre-trained RWKV-7 weights Yes HuggingFace (BlinkDL) Apache 2.0
MiniPile dataset Yes HuggingFace Open
Countdown task Yes Standard benchmark N/A
GSM8K dataset Yes HuggingFace MIT

Reproducibility Assessment

Verdict: Highly reproducible. All core components are open-source, the algorithm is described in full mathematical detail, code is provided in multiple repositories, and the base models (RWKV-7) are freely available. The primary barrier is hardware: the headline experiments require H100 GPUs.

What Can Be Reproduced

  • The complete EGGROLL algorithm (JAX implementation provided)
  • EGG (int8 RNN) training via nano-egg single-file codebase
  • RWKV-7 fine-tuning on countdown and GSM8K
  • Throughput benchmarks on H100 GPUs
  • Theoretical convergence analysis (proofs in paper appendix)

Hardware Requirements for Reproduction

Experiment Minimum Hardware Ideal Hardware
Nano-EGG training 1x A100 80GB 1x H100
RWKV 1.5B fine-tuning 1x A100 80GB 4x H100
RWKV 7B fine-tuning 4x A100 80GB 8x H100
Throughput benchmarks 1x H100 8x H100
Full paper reproduction 8x H100 64x H100

Community Contribution Model

The nano-egg repository explicitly encourages community contributions, modeled after the nanogpt speedrun:

"We highly encourage community contributions, similar to the nanogpt speedrun, to see how efficient we can make pure evolution pretraining in integer formats!"

8 Compute and API Costs

Throughput Economics

EGGROLL's key economic insight is that ES training becomes nearly as cheap as inference. This has profound implications:

Cost Model (EGGROLL vs. Backprop for LLM Training)

                    Backprop              EGGROLL
                    ────────              ───────
Forward pass:       1x FLOPS             1x FLOPS (same)
Backward pass:      ~2x FLOPS            0 FLOPS (no backprop)
Optimizer state:    ~3x memory           0 memory (no Adam state)
Population:         1 (or micro-batch)   N members (parallelized)
                    ────────              ───────
Total FLOPS:        ~3x per sample       ~1x per sample × N population
Total memory:       Model + optimizer    Model + perturbation keys
                    + activations

H100 Throughput Numbers

Configuration Tokens/Second % of Inference GPU Utilization
Pure batch inference ~10M tok/s 100% ~95%
EGGROLL (rank-1) ~9.1M tok/s 91% ~87%
Naive ES ~0.1M tok/s 1% ~5%
Backprop training ~3.3M tok/s 33% ~90%

Memory Costs

Component Naive ES EGGROLL (rank-1)
Perturbation storage per member O(m × n) per layer O(m + n) per layer
Total perturbation memory ~4x model size ~2/min(m,n) × model size
RNG state Single seed per member Single seed per member
Practical impact (7B model) ~28 GB per member ~0.01 GB per member

Cost Comparison for LLM Reasoning Training

Method Hardware Time Estimated Cost (cloud)
GRPO (backprop, 1.5B) 4x A100 ~4 hours ~$50
Naive ES (1.5B, pop=1K) 4x A100 ~400 hours ~$5,000
EGGROLL (1.5B, pop=1M) 4x H100 ~4 hours ~$80
GRPO (backprop, 7B) 8x A100 ~12 hours ~$300
EGGROLL (7B, pop=1M) 8x H100 ~8 hours ~$320

EGGROLL makes ES cost-competitive with backprop-based methods for the first time at billion-parameter scale.

9 Architecture Solution

System Architecture

┌──────────────────────────────────────────────────────────────┐
│                    EGGROLL Training System                    │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │                   Population Manager                    │  │
│  │  ┌──────────────────────────────────────────────────┐  │  │
│  │  │  RNG Key → fold_in(key, thread_id) → (A_i, B_i) │  │  │
│  │  │  For each population member i = 1..N:             │  │  │
│  │  │    - Generate rank-r perturbation (A_i, B_i)      │  │  │
│  │  │    - No storage needed (recomputable from seed)    │  │  │
│  │  └──────────────────────────────────────────────────┘  │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                  Batched Forward Pass                    │  │
│  │                                                         │  │
│  │  For each layer with weight W ∈ R^{m×n}:               │  │
│  │    y_shared = x @ W^T           (standard matmul, 1x)  │  │
│  │    y_perturb = σ · (x @ B_i) @ A_i^T  (batched, fast) │  │
│  │    y_i = y_shared + y_perturb   (per population member)│  │
│  │                                                         │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                  Fitness Evaluation                      │  │
│  │                                                         │  │
│  │  For each population member i:                          │  │
│  │    - Generate output sequence                           │  │
│  │    - Compute fitness F_i (task-dependent)               │  │
│  │    - Return scalar fitness value                        │  │
│  │                                                         │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                   Parameter Update                       │  │
│  │                                                         │  │
│  │  Gradient estimate:                                     │  │
│  │    g = (1/Nσ) Σ_i F_i · B_i · A_i^T                   │  │
│  │                                                         │  │
│  │  Fused directly into parameters:                        │  │
│  │    W ← W + α · g    (full-rank update, rank ≤ N·r)    │  │
│  │                                                         │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Data Flow for One Training Step

Step 1: Sample input data batch
         D = {x_1, x_2, ..., x_B}  (B sequences)

Step 2: Distribute across population
         Each of N population members gets some subset of D
         (or all members share all of D)

Step 3: Forward pass with EGGROLL perturbations
         For member i, layer l:
           y_i^l = x^l @ W_l^T + σ · (x^l @ B_i^l) @ (A_i^l)^T

Step 4: Generate outputs & compute fitness
         F_i = fitness(generate(model_i, input))

Step 5: Compute ES gradient
         For each layer l:
           g_l = (1/Nσ) Σ_i F_i · B_i^l · (A_i^l)^T

Step 6: Update parameters
         W_l ← W_l + α · g_l

Repeat steps 1-6.

Parallelism Model

GPU Memory Layout (Single H100, 80GB)

┌──────────────────────────────────────────────┐
│  Model Weights (shared):     ~14 GB (7B f16) │
│  ┌────────────────────────────────────────┐  │
│  │  Batch of population members           │  │
│  │  (vmap over thread_id dimension)       │  │
│  │                                        │  │
│  │  Thread 0: seed_0 → (A_0, B_0) → F_0 │  │
│  │  Thread 1: seed_1 → (A_1, B_1) → F_1 │  │
│  │  ...                                   │  │
│  │  Thread K: seed_K → (A_K, B_K) → F_K │  │
│  │                                        │  │
│  │  Each thread: ~0.01 GB overhead        │  │
│  └────────────────────────────────────────┘  │
│                                              │
│  Input data buffer:          ~2 GB           │
│  Activation memory:          ~10 GB          │
│  Remaining / fragmentation:  ~54 GB          │
└──────────────────────────────────────────────┘

Population members fit in the "remaining" budget.
With ~0.01 GB per member overhead:
  54 GB / 0.01 GB ≈ 5,400 members per GPU pass
  With gradient accumulation: millions of members possible

Multi-GPU Scaling

8x H100 (NVLink interconnect)

GPU 0: Members [0, N/8)           ──┐
GPU 1: Members [N/8, 2N/8)        ──┤
GPU 2: Members [2N/8, 3N/8)       ──┤
GPU 3: Members [3N/8, 4N/8)       ──┼──► AllReduce(fitness-weighted
GPU 4: Members [4N/8, 5N/8)       ──┤    perturbation sums)
GPU 5: Members [5N/8, 6N/8)       ──┤    → single gradient update
GPU 6: Members [6N/8, 7N/8)       ──┤
GPU 7: Members [7N/8, N)          ──┘

Communication: Only scalar fitness values + gradient updates
(NOT perturbation matrices — these are recomputable from seeds)

10 Component Breakdown

10.1 Random Number Generation (RNG) System

The perturbation generation uses JAX's splittable PRNG system:

def generate_perturbation(base_key, thread_id, shape, rank=1):
    """Generate a rank-r perturbation from a single base key + thread ID."""
    key = jax.random.fold_in(base_key, thread_id)
    m, n = shape
    params = jax.random.normal(key, (m + n, rank))
    B = params[:n]   # n x r
    A = params[n:]   # m x r
    return A, B

Properties: - Deterministic: Same (base_key, thread_id) always produces the same perturbation - No storage: Perturbations are recomputed on-the-fly, not stored - Parallelizable: fold_in is O(1) and independent across threads - Communication-free: Each GPU can reconstruct any member's perturbation from the shared base key

10.2 EGGROLL Forward Pass

The core computation replaces the standard linear layer:

def eggroll_linear(base_key, sigma, W, x, thread_id, rank=1):
    """EGGROLL-perturbed linear layer."""
    key = jax.random.fold_in(base_key, thread_id)
    m, n = W.shape

    perturbation_params = jax.random.normal(key, (m + n, rank))
    B = perturbation_params[:n]   # n x r
    A = perturbation_params[n:]   # m x r

    # Standard matmul (shared across population, computed once)
    y_base = x @ W.T

    # Per-member perturbation (batched, very fast)
    y_perturb = sigma * (x @ B) @ A.T

    return y_base + y_perturb


# Vectorize over population dimension
batched_eggroll = jax.vmap(eggroll_linear, in_axes=(None, None, None, 0, 0))

10.3 Fitness Evaluation Pipeline

Input Sequences → Model Forward Pass → Output Tokens → Fitness Score
      │                 │                    │               │
      │          EGGROLL perturbations       │               │
      │          (per population member)     │               │
      │                                      │               │
      └── Shared across members ─────────────┘               │
                                                             │
                                              ┌──────────────┘
                                              │
                                    Task-specific:
                                    • Countdown: exact match
                                    • GSM8K: numerical answer
                                    • LM: cross-entropy loss
                                    • RL: cumulative reward

10.4 Gradient Estimation

The ES gradient estimate in EGGROLL:

def eggroll_gradient(fitnesses, perturbation_keys, sigma, W_shape, rank=1):
    """Compute the EGGROLL gradient estimate."""
    N = len(fitnesses)
    m, n = W_shape

    gradient = jnp.zeros((m, n))

    for i in range(N):
        A_i, B_i = generate_perturbation(perturbation_keys[i], W_shape, rank)
        gradient += fitnesses[i] * (A_i @ B_i.T)

    return gradient / (N * sigma)

In practice, the gradient computation is fused into the parameter update and vectorized:

# Fused update (actual implementation)
def update_params(params, fitnesses, base_key, sigma, lr, rank=1):
    """Fuse gradient computation and parameter update."""
    centered_fitnesses = fitnesses - fitnesses.mean()

    for layer_name, W in params.items():
        # Reconstruct all perturbations and compute weighted sum
        # (vectorized, not an explicit loop)
        delta_W = vmap_weighted_perturbation_sum(
            centered_fitnesses, base_key, layer_name, W.shape, rank
        )
        W -= lr * delta_W / (len(fitnesses) * sigma)

10.5 EGG Architecture (Evolved Generative GRU)

The EGG model is a custom architecture designed to demonstrate EGGROLL's unique capabilities:

EGG Architecture (D=256, L=6)

Input tokens (int8 indices)
      │
      ▼
┌───────────────┐
│  Embedding     │  int8 lookup table
│  (256-dim)     │
└───────┬───────┘
        │ (int8)
┌───────▼───────┐
│  minGRU Block  │  No tanh, no sigmoid
│  (modified)    │  int8 matmul + int32 accumulate + int8 cast
│  × 6 layers    │  Implicit nonlinearity from int8 overflow
│                │
│  MLP Block     │  No activation functions
│  (no act fn)   │  int8 matmul only
└───────┬───────┘
        │ (int8)
┌───────▼───────┐
│  Output head   │  int8 → int32 logits
│  + softmax     │  Softmax via lookup table (no float)
└───────┬───────┘
        │
      Loss (bits/byte)

Key design decisions: - All int8 weights: Fastest datatype on H100 Tensor Cores - No activation functions: The int8→int32→int8 cast chain introduces implicit nonlinearity - No optimizer state: EGGROLL has no momentum, no Adam states — just parameter updates - Lookup-table softmax: Even the loss computation avoids floating point

10.6 RWKV-7 Integration

For LLM fine-tuning experiments, EGGROLL wraps the existing RWKV-7 model:

RWKV-7 "Goose" Architecture
      │
      ▼
┌────────────────────┐
│  Token embedding    │
└────────┬───────────┘
         │
┌────────▼───────────┐
│  RWKV-7 Block × N  │
│  ┌───────────────┐ │
│  │  Time mixing   │ │  ← EGGROLL perturbs these weights
│  │  (linear attn) │ │
│  └───────────────┘ │
│  ┌───────────────┐ │
│  │  Channel mix   │ │  ← EGGROLL perturbs these weights
│  │  (FFN variant) │ │
│  └───────────────┘ │
└────────┬───────────┘
         │
┌────────▼───────────┐
│  Language model head│
└────────┬───────────┘
         │
      Generate response → Evaluate fitness

11 Core Mechanisms (Detailed)

11.1 Low-Rank Perturbation Theory

The mathematical foundation of EGGROLL rests on replacing the standard ES gradient estimator with a structured variant.

Standard ES gradient:

∇_θ E[F(θ + σε)] = (1/σ) E[F(θ + σε) · ε],   ε ~ N(0, I_d)

EGGROLL gradient (for matrix parameter W ∈ R^{m×n}):

∇_W E[F(W + σ·BA^T)] = (1/σ) E[F(W + σ·BA^T) · BA^T],   A ~ N(0, I_m), B ~ N(0, I_n)

Consistency Theorem (informal): As the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate at rate O(1/r), where r is the perturbation rank.

Linearizing Effect: The paper proves that in high dimensions, the objective function locally linearizes around the current parameters, meaning rank-1 perturbations capture the dominant gradient direction. This is analogous to how random projections preserve distances in high dimensions (Johnson-Lindenstrauss lemma).

11.2 Arithmetic Intensity Analysis

The key to EGGROLL's speedup is arithmetic intensity — the ratio of compute (FLOPS) to memory bandwidth:

Operation                     FLOPS          Memory Bytes     Arithmetic Intensity
─────────────────────────────────────────────────────────────────────────────────
Standard matmul (m×k × k×n):  2mkn           2(mk + kn + mn)  mkn / (mk+kn+mn)
                                                                ≈ k (for m=n=k)

Batched matmul (N × m×k × k×n): 2Nmkn        2N(mk + kn + mn) Same per-element
                               But N separate kernel launches → low GPU utilization

EGGROLL decomposition:
  x @ W^T (shared):           2mkn           2(mk + kn + mn)  ≈ k (high)
  x @ B (batched, B ∈ R^{n×r}): 2Nkr        2N(kr + nr + kr) ≈ min(k,r) (ok)
  (xB) @ A^T (batched):       2Nmr           2N(mr + mr + mr) ≈ r/3 (fast for r=1)

For typical transformer hidden dimensions (k = 4096) and rank r = 1: - Standard matmul: intensity ≈ 4096 (excellent) - EGGROLL perturbation: intensity ≈ 1–4096 (architecture-dependent, but much better than batched full-rank)

11.3 Population Scaling Dynamics

EGGROLL enables population sizes 3 orders of magnitude larger than prior ES work:

Population Size Regimes

OpenAI ES (2017):      N ≈ 1,000      │ Full-rank perturbations
                                       │ Small NNs (MuJoCo)
                                       │
ES-LLM (2025):        N ≈ 10          │ Small population, many rollouts
                                       │ per member to reduce variance
                                       │
EGGROLL:               N ≈ 1,000,000  │ Rank-1 perturbations
                                       │ Billion-parameter models
                                       │ High throughput

Theoretical implication:
- Gradient estimate variance ∝ 1/N
- At N = 10^6, variance is 1000x lower than N = 10^3
- This enables stable updates with larger learning rates

11.4 Noise-Reuse for Sequence Processing

For language modeling, EGGROLL incorporates Noise-Reuse ES (Vicol et al., 2023):

Standard ES on sequences:
  Each token position: new perturbation → O(T) perturbation samples
  Memory: O(T × d) per population member

Noise-Reuse ES:
  Reuse the same perturbation across multiple token positions
  Take multiple parameter updates within a single sequence
  Memory: O(d) per population member (independent of T)

Timeline for one sequence (T=100 tokens):
┌──────┬──────┬──────┬──────┬──────┐
│tok 1 │tok 25│tok 50│tok 75│tok100│
│perturb│      │update│      │update│
│  ε_1  │ ε_1  │  ε_2 │ ε_2  │ done │
└──────┴──────┴──────┴──────┴──────┘
         Same perturbation reused
         Update after every K tokens

11.5 Integer Arithmetic as Nonlinearity

The most conceptually striking mechanism in the paper is the use of integer overflow as a source of nonlinearity:

Float32 behavior:                    Int8 behavior:
  3.0 × 50.0 = 150.0 (linear)        3 × 50 = 150 → 127 (saturated!)
  3.0 × 100.0 = 300.0 (linear)       3 × 100 = 300 → 44 (overflow!)

Int8 multiplication chain:
  Input (int8) → Matmul → Accumulate (int32) → Cast (int8)
                                                  ↑
                                          Implicit nonlinearity!
                                          Values > 127 wrap/saturate
                                          Creates sigmoid-like behavior

This is inspired by prior OpenAI work showing that "nonlinear computation in deep linear networks" can emerge from floating-point rounding. EGGROLL takes this further: int8's extreme quantization makes the nonlinearity pronounced enough to train useful models.

11.6 Update Rule for EGG (Integer)

The EGG model uses a specialized update rule for integer parameters:

def egg_update(W_int8, gradient_estimate, threshold=1):
    """Integer-compatible parameter update."""
    # Threshold: only update if gradient is large enough
    update = jnp.where(
        jnp.abs(gradient_estimate) > threshold,
        jnp.sign(gradient_estimate),  # Step by ±1 in int8
        0
    )
    return jnp.clip(W_int8 + update, -128, 127).astype(jnp.int8)

Properties: - No learning rate (step size is always ±1 in int8 space) - No momentum or optimizer state - Threshold prevents noise from dominating updates - clip ensures int8 range

12 Programming Language

Implementation Stack

Component Language Framework Why
EGGROLL core Python JAX vmap for population parallelism, XLA compilation
EGG model Python JAX int8 tensor operations via jnp
RWKV-7 model Python JAX (jaxrwkv) Custom RWKV port for JAX
Throughput benchmarks Python JAX + CUDA Hardware-level profiling
Nano-egg Python JAX Single-file implementation

Why JAX?

JAX is the enabling technology for EGGROLL. Several JAX features are critical:

  1. jax.vmap: Automatically vectorizes the forward pass over the population dimension. Without vmap, implementing population parallelism would require explicit batching code.
# Without vmap: explicit loop (slow)
for i in range(N):
    y_i = forward(params, x[i], perturbation[i])

# With vmap: automatic vectorization (fast)
batched_forward = jax.vmap(forward, in_axes=(None, 0, 0))
y = batched_forward(params, x, perturbation_ids)
  1. jax.random.fold_in: Deterministic PRNG that allows per-member perturbation generation without communication:
# Each member gets a unique but deterministic perturbation
key_i = jax.random.fold_in(base_key, thread_id)
perturbation_i = jax.random.normal(key_i, param.shape)
  1. XLA compilation: JAX's ahead-of-time compilation fuses the EGGROLL operations into efficient GPU kernels, avoiding Python overhead.

  2. Integer support: JAX supports int8 tensor operations via jnp.int8, enabling the EGG experiments.

  3. Multi-device: JAX's pmap enables multi-GPU parallelism with minimal code changes.

Why Not PyTorch?

The paper's authors chose JAX specifically because: - PyTorch's vmap (functorch) is less mature than JAX's - PyTorch's PRNG system doesn't have an equivalent to fold_in for deterministic per-member perturbations - JAX's XLA compilation provides better kernel fusion for the EGGROLL computation pattern - PyTorch's int8 support is primarily for inference quantization, not training

Code Organization

HyperscaleES/
├── eggroll/
│   ├── core.py              # EGGROLL algorithm implementation
│   ├── perturbation.py      # Low-rank perturbation generation
│   ├── update.py            # Parameter update rules
│   └── utils.py             # Fitness normalization, logging
├── models/
│   ├── egg.py               # EGG (int8 GRU) architecture
│   ├── rwkv.py              # RWKV-7 wrapper for EGGROLL
│   └── mlp.py               # Simple MLP for RL experiments
├── tasks/
│   ├── countdown.py         # Countdown reasoning task
│   ├── gsm8k.py             # GSM8K evaluation
│   ├── language_model.py    # Character-level LM (MiniPile)
│   └── rl_envs.py           # RL environment wrappers
├── configs/
│   ├── egg_minipile.yaml    # EGG pretraining config
│   ├── rwkv_countdown.yaml  # RWKV countdown config
│   └── rwkv_gsm8k.yaml     # RWKV GSM8K config
└── scripts/
    ├── train.py             # Main training script
    ├── benchmark.py         # Throughput benchmarking
    └── evaluate.py          # Model evaluation

13 Memory Management

Perturbation Memory: The Key Innovation

The most important memory optimization in EGGROLL is that perturbations are never stored — they are regenerated from RNG seeds:

Naive ES memory:
  N population members × d parameters × 4 bytes (float32)
  = N × d × 4 bytes

  For N = 10^6, d = 7×10^9 (7B model):
  = 7 × 10^15 bytes = 7 PB  ← Obviously impossible!

EGGROLL memory:
  N population members × 1 seed (8 bytes)
  + shared model parameters (d × 2 bytes for float16)

  For N = 10^6, d = 7×10^9:
  = 8 MB (seeds) + 14 GB (model) = 14.008 GB  ← Fits on one GPU!

Gradient Accumulation Memory

The gradient is accumulated online without storing all perturbations:

# Memory-efficient gradient accumulation
gradient = jnp.zeros_like(W)
for batch_of_members in chunks(range(N), chunk_size):
    # Generate perturbations on-the-fly
    perturbations = vmap(generate_perturbation)(keys[batch_of_members])
    # Evaluate fitness
    fitnesses = vmap(evaluate)(perturbations)
    # Accumulate gradient
    gradient += jnp.einsum('i,ijk->jk', fitnesses, perturbations)

Peak memory: model_size + chunk_size × perturbation_overhead

Memory Budget Breakdown (7B RWKV on 8× H100)

Component Per-GPU Memory Total
Model weights (float16, replicated) 14 GB 14 GB × 8
Input data buffer 2 GB 2 GB × 8
Activation memory (per chunk) 8 GB 8 GB × 8
Perturbation overhead (per chunk) 0.5 GB 0.5 GB × 8
Gradient accumulator 14 GB 14 GB × 8
Framework overhead 2 GB 2 GB × 8
Total ~40.5 GB
Available (H100 80GB) 80 GB
Headroom ~39.5 GB

Memory Comparison: EGGROLL vs. Backprop (Adam)

Component Backprop + Adam EGGROLL
Model weights 14 GB (7B × 2B) 14 GB (same)
Gradients 14 GB 0 (computed online)
Adam momentum (m) 14 GB 0 (no optimizer state)
Adam variance (v) 14 GB 0 (no optimizer state)
Activations (for backward) 20-40 GB 0 (no backward pass)
Perturbation seeds 0 8 MB
Gradient accumulator 0 14 GB
Total 62-82 GB ~28 GB

EGGROLL uses approximately half the memory of Adam-based backprop training for the same model, primarily because it eliminates optimizer states and activation storage for the backward pass.

14 Continued Learning

Within-Run Learning

EGGROLL's learning dynamics within a single training run:

  1. Population-based gradient estimation: Each step produces a gradient estimate from N fitness evaluations. Unlike backprop (deterministic gradient), ES gradient has variance ∝ 1/N.

  2. No momentum (EGG variant): The EGG model uses no optimizer state — each update is independent. This means there's no "memory" of past gradients.

  3. Standard optimizers (RWKV variant): For LLM fine-tuning, EGGROLL can be combined with standard optimizers (SGD with momentum, Adam) applied to the ES gradient estimate.

Cross-Task Transfer

EGGROLL itself is a training algorithm, not a model. Transfer learning occurs at the model level:

  • Pre-trained RWKV-7 weights serve as the initialization for fine-tuning
  • EGGROLL fine-tuning preserves the base model's capabilities while adapting to new tasks
  • The same EGGROLL implementation works across tasks (countdown, GSM8K, RL) without modification

Meta-Learning Potential

EGGROLL opens several meta-learning possibilities not yet explored in the paper:

Direction Description Status
Learned rank selection Adapt rank r during training based on gradient quality Not implemented
Adaptive σ Learn the perturbation scale per-layer Standard ES technique, applicable
Population scheduling Vary N over training (large early, small late) Not explored
Multi-fidelity ES Use cheap fitness approximations early, expensive later Not explored
Fitness shaping Learn the fitness transformation function Standard ES technique, applicable

Relationship to LoRA and Continual Fine-Tuning

EGGROLL has an interesting relationship to LoRA-based continual learning:

LoRA fine-tuning:
  W' = W + B·A^T (always low-rank, r typically 16-64)
  Multiple tasks: merge adapters or keep separate

EGGROLL fine-tuning:
  W' = W + Σ(f_i · B_i · A_i^T) (full-rank after aggregation)
  Multiple tasks: standard multi-task fitness function

Key difference: EGGROLL's updates are full-rank, so there's no
"adapter" to merge — the model weights are directly modified.
This is more like full fine-tuning in terms of expressiveness,
but achieved through inference-speed forward passes only.

Self-Play and Multi-Agent Extensions

The authors mention ongoing work on multi-agent optimization:

"We think that EGGROLL has strong potential to directly optimize LLMs with multi-agent awareness, breaking the best-of-k curse of RL."

This connects to their prior work on Social Deduction LLMs (Among Us) and suggests future applications in cooperative/competitive multi-agent training where end-to-end optimization through agent interactions is desirable but backprop through the interaction is intractable.

15 Applications

15.1 LLM Post-Training for Reasoning

The most immediately practical application: replacing or supplementing GRPO/RLHF for LLM reasoning training.

Advantage Description
No reward model needed Fitness is computed directly from task output
No KL penalty tuning ES naturally stays near initialization (small σ)
Non-differentiable rewards Can optimize for exact match, code execution, tool use
Chain-of-thought optimization Fitness includes reasoning chain quality
Multi-turn optimization Natural extension to dialogue/agent settings

Target tasks: - Mathematical reasoning (GSM8K, MATH, Olympiad problems) - Code generation (HumanEval, MBPP, SWE-bench) - Tool use optimization - Multi-step agent workflows

15.2 Novel Architecture Exploration

EGGROLL enables training architectures that are impractical with backprop:

Architecture Type Why Backprop Fails EGGROLL Solution
Pure integer NNs No float gradients Forward-pass only, int8 operations
Lookup-table layers Non-differentiable Black-box optimization
Discrete attention Argmax not differentiable Fitness-based selection
Spiking neural networks Non-differentiable spikes Direct fitness optimization
Neuromorphic architectures Hardware-specific ops Hardware-in-the-loop training
State machines + NNs Discrete transitions End-to-end fitness

15.3 Neurosymbolic System Optimization

The paper highlights neurosymbolic optimization as a key future direction:

"We are particularly interested in the end-to-end optimization of neurosymbolic systems, since EGGROLL naturally handles nondifferentiable components within a model."

Neurosymbolic System (optimized end-to-end by EGGROLL):

Input → [Neural Encoder] → [Symbolic Reasoner] → [Neural Decoder] → Output
              ↑                    ↑                     ↑
         Differentiable      Non-differentiable      Differentiable
              ↑                    ↑                     ↑
         ├──── EGGROLL optimizes ALL components simultaneously ────┤

Specific targets mentioned: - ROSA architecture for RWKV-8 (discrete memory system) - LLMs with external tool calls (non-differentiable tool invocations) - Code-writing agents (discrete code generation + execution)

15.4 Discrete Diffusion Model Training

The paper mentions discrete diffusion models as a target:

"We would also like to enable us to try EGGROLL on other LLMs, including Discrete Diffusion models for which the standard policy gradient theorem is technically intractable (due to the mask-based sampling procedure)."

Discrete diffusion models (MDLM, SEDD, etc.) generate text by iteratively demasking tokens. The masking/demasking procedure is non-differentiable, making standard policy gradients inapplicable. EGGROLL's black-box nature makes it directly applicable.

15.5 Hardware-Accelerated Integer Training

The EGG experiments suggest a new paradigm for hardware-efficient training:

Current Training Pipeline:
  Model (float16/bfloat16) → Forward (float16) → Backward (float16) → Adam (float32)
  GPU utilization: ~30-50% of peak FLOPS

EGGROLL Integer Pipeline:
  Model (int8) → Forward (int8 matmul, int32 accum) → Fitness → Update (int8)
  GPU utilization: ~80-91% of peak FLOPS (inference-speed)

H100 Peak FLOPS by Datatype:
  float32:    67 TFLOPS
  float16:    989 TFLOPS
  bfloat16:   989 TFLOPS
  int8:       1,979 TOPS     ← 2x more than float16!

EGGROLL with int8 can theoretically achieve 2x the throughput
of float16 training, on top of the ~100x ES speedup.

15.6 Reinforcement Learning

While demonstrated only on standard benchmarks in this paper, EGGROLL is naturally suited to RL:

RL Setting EGGROLL Advantage
Sparse rewards No gradient through sparse signal needed
Multi-agent End-to-end team optimization
Real-world robotics No simulator differentiability required
Safety-constrained Fitness includes safety penalties directly
Sim-to-real transfer Optimize directly in the target environment

15.7 Scientific Discovery

EGGROLL could be applied to optimize scientific models where the evaluation is non-differentiable:

Domain Model Fitness Function
Drug discovery Molecular generator NN Binding affinity (docking score)
Materials science Crystal structure predictor DFT energy (non-differentiable)
Protein design Structure prediction NN Experimental stability
Climate modeling Neural weather model Forecast accuracy

15.8 Limitations and Boundary Conditions

Limitation Impact Mitigation
Sample efficiency ES needs many evaluations Large population + high throughput
Gradient quality at rank-1 May miss important directions Increase rank r if needed
Exploration vs. exploitation Fixed σ limits exploration Adaptive σ scheduling
No second-order information Cannot exploit curvature Larger population compensates
Hardware requirements H100 for headline results Scales down to A100, less dramatically
JAX ecosystem Smaller than PyTorch PyTorch port underway
Transformer support KV-cache memory issue vLLM/Megatron port in progress
Paradigm Gradient Memory Throughput Generality
Backpropagation Exact High (activations) ~30% of inference Differentiable only
REINFORCE Noisy Low ~50% of inference Any reward
PPO/GRPO Noisy + baseline Medium ~30% of inference Any reward
Standard ES Noisy Very high (perturbations) ~1% of inference Any fitness
EGGROLL Noisy (low-rank) Low ~91% of inference Any fitness
Zeroth-order methods Finite differences Low ~50% of inference Any function

15.10 The EGGROLL Vision: Inference IS Training

The paper's most provocative claim is that EGGROLL fundamentally changes the relationship between inference and training:

Traditional view:
  Training ≠ Inference
  Training: forward + backward + optimizer = expensive
  Inference: forward only = cheap
  Cost ratio: Training / Inference ≈ 3-5x

EGGROLL view:
  Training ≈ Inference
  Training: forward + perturbation + fitness = almost as fast as inference
  Inference: forward only
  Cost ratio: Training / Inference ≈ 1.1x

Implication: Any system that can run batched inference can also TRAIN,
with only 10% overhead. This means:
  - Edge devices can self-improve
  - Inference servers can simultaneously fine-tune
  - Any fitness function becomes a training signal
  - The distinction between deployment and training dissolves

This vision, if realized at larger scale and across more architectures, would represent a fundamental shift in how ML systems are deployed and improved.


This analysis is based on the arXiv paper (2511.16652v2), the project website (eshyperscale.github.io), the open-source code repositories (HyperscaleES, nano-egg, jaxrwkv), and supplementary information from the AlphaXiv discussion page. EGGROLL represents a significant advance in making evolution strategies practical for billion-parameter model training, with implications spanning LLM post-training, novel architecture exploration, and the fundamental relationship between inference and learning.