EGGROLL — Evolution Strategies at the Hyperscale

Low-Rank Evolution Strategies Achieving 100x Speedup for Billion-Parameter Model Training Organization: University of Oxford, MILA, NVIDIA Published: November 20, 2025 Type: Paper (arXiv:2511.16652) + Blog + Code Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Evolution Strategies at the Hyperscale

Algorithm Name: EGGROLL — Evolution Guided GeneRal Optimisation via Low-rank Learning

ArXiv: 2511.16652 (cs.LG / cs.AI)

Project Page: eshyperscale.github.io

Code Repositories: - Main library: ESHyperscale/HyperscaleES (JAX-based) - Single-file EGG training: ESHyperscale/nano-egg - RWKV-7 in JAX: bsarkar321/jaxrwkv

AlphaXiv Discussion: alphaxiv.org/abs/2511.16652

Submission History: - v1: November 20, 2025 - v2: February 16, 2026 (revised with extended experiments)

First Public Commit: August 13, 2025 — jaxrwkv commit 6d92566 (early EGGROLL prototype)

Lineage: Builds on OpenAI's Evolution Strategies (Salimans et al., 2017), Noise-Reuse ES (Vicol et al., 2023), and structural insights from LoRA (Hu et al., 2022). Concurrent with and compared against ES-LLM (arXiv:2509.24372).

2 Authors and Team

Core Authors

Author	Affiliation	Role	Marker
Bidipta Sarkar	Oxford / Stanford	Co-lead, algorithm design, RWKV integration	* (equal)
Mattie Fellows	Oxford	Co-lead, theoretical analysis	* (equal)
Juan Agustin Duque	MILA	Co-lead, implementation	* (equal)
Shimon Whiteson	Oxford / Waymo	Senior advisor	* (equal)
Jakob Nicolaus Foerster	Oxford	Senior advisor, FORL group lead	* (equal)
Aaron Courville	MILA	Senior advisor
Karin Sevegnani	NVIDIA	Industry advisor
Alexander David Goldie	Oxford	Contributing author

Contributing Authors (†)

Author	Contribution Area
Alistair Letcher	Theoretical convergence analysis
Antonio León Villares	Implementation, benchmarking
Anya Sims	EGG architecture design
Clarisse Wibault	Experiments
Dmitry Samsonov	GPU kernel optimization
Dylan Cope	RL experiments
Jarek Liesen	Infrastructure
Kang Li	Throughput benchmarking
Lukas Seier	Integer arithmetic experiments
Theo Wolf	RWKV integration
Uljad Berdica	Experiments
Valentin Mohl	Theory

Research Group Context

The work originates primarily from the Foundations of Reinforcement Learning (FORL) group at Oxford, led by Jakob Foerster. The group has a track record in multi-agent RL, zero-shot coordination, and policy gradient methods. EGGROLL represents a deliberate pivot toward gradient-free optimization methods as a complement (and potential alternative) to backpropagation-based training. The collaboration with MILA (Aaron Courville's group) and NVIDIA brings scalability expertise and hardware acceleration knowledge.

Bidipta Sarkar's prior work on Social Deduction LLM (using RWKV for multi-agent Among Us) directly motivated the choice of RWKV as the LLM architecture for EGGROLL experiments.

3 Core Contribution

Key Novelty: EGGROLL eliminates the computational barrier between inference and training for evolution strategies by structuring perturbations as rank-r matrices, achieving a hundredfold speedup over naive ES for billion-parameter models while preserving full-rank parameter updates through population aggregation.

The Fundamental Problem

Standard Evolution Strategies (ES) perturb each parameter independently:

θ_perturbed = θ + σ · ε,   where ε ~ N(0, I_d)

For a weight matrix W ∈ R^{m×n}, this requires: - Storing m × n random numbers per population member - Computing a batched matrix multiplication: x @ (W + σ·ε)^T - GPU inefficiency: Batched matmuls with unstructured perturbations have low arithmetic intensity on modern GPUs (many random memory accesses, poor cache utilization)

At scale (billions of parameters, millions of population members), this becomes prohibitively slow — the batched matmuls dominate runtime and achieve far less than peak GPU FLOPS.

EGGROLL's Solution

Replace unstructured perturbations with rank-r structured perturbations:

θ_perturbed = θ + σ · B · A^T,   where A ∈ R^{m×r}, B ∈ R^{n×r}, sampled i.i.d. Gaussian

This transforms the computation:

NAIVE ES:   y = x @ (W + σ·ε)^T                     → Batched matmul (slow)
EGGROLL:    y = x @ W^T + σ · (x @ B) @ A^T          → Standard matmul + batched outer product (fast)

The key insight: x @ W^T is a standard (non-batched) matrix multiplication (same for all population members), and (x @ B) @ A^T is a batched vector-vector multiplication at rank 1, which has much higher arithmetic intensity.

Why This Works (Theoretical Guarantee)

Although individual perturbations are rank-r, the aggregated update across the population is full-rank:

Update = (1/σ) · E[F(θ + σ·BA^T) · BA^T]

Since BA^T is rank-r, but the sum of N rank-r matrices (for N population members)
has rank min(N·r, min(m,n)) — which is full-rank when N·r ≥ min(m,n).

The paper proves a consistency theorem: as the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate. The convergence rate is O(1/r), meaning even rank-1 perturbations provide useful gradient information.

Relationship to Prior Work

System	Year	Approach	Population Size	Scale	Speed
OpenAI ES	2017	Full-rank perturbations	~1,000	MuJoCo (small NNs)	Baseline
Uber ES	2018	Novelty search + ES	~1,000	Atari (small NNs)	~1x baseline
ES on LLMs	2025	Small population, many rollouts	~10	1-7B LLMs	~1x (avoids batched matmul)
LoRA + ES	2025	Low-rank adapters only	~100	1-7B LLMs	Moderate
EGGROLL	2025	Low-rank perturbations, full-rank updates	~1,000,000	1-7B LLMs	~100x over naive ES

Critical Distinction: EGGROLL vs. LoRA + ES

Using ES to optimize LoRA adapters directly restricts the update to low-rank forever. EGGROLL uses low-rank perturbations but accumulates them into full-rank parameter updates, making it strictly more expressive:

LoRA + ES:    W' = W + B·A^T (always rank-r)
EGGROLL:      W' = W + Σ_i (B_i · A_i^T) · fitness_i  (rank up to N·r)

This distinction is critical for pretraining (requires full-rank updates) vs. fine-tuning (where low-rank may suffice).

4 Supported Solutions

EGGROLL is a general-purpose optimization algorithm applicable to any differentiable or non-differentiable objective. The paper demonstrates four solution domains:

Solution Domain	Description	Model	Task
Pure integer pretraining	Train a nonlinear RNN entirely in int8	EGG (Evolved Generative GRU)	Character-level LM on MiniPile
LLM reasoning (fine-tuning)	Post-training for mathematical reasoning	RWKV-7 (1.5B, 7B)	Countdown, GSM8K
Tabula rasa RL	Standard RL benchmark training	Small NNs	MuJoCo-like environments
Non-differentiable optimization	Systems with discrete/non-differentiable components	Any architecture	Any fitness function

What EGGROLL Enables That Backprop Cannot

Capability	Backprop	EGGROLL
Integer-only training	No (requires float gradients)	Yes (only needs forward pass)
Non-differentiable activations	No	Yes
Black-box fitness functions	No (needs differentiable loss)	Yes
Training without activation functions	Impractical (vanishing gradients)	Yes (demonstrated with EGG)
Optimizing discrete components	Requires relaxation (Gumbel-Softmax, etc.)	Direct optimization
Hardware-in-the-loop training	No (non-differentiable hardware)	Yes (just needs input → output)
Multi-agent end-to-end optimization	Limited (credit assignment)	Natural (fitness = team outcome)

What EGGROLL Does NOT Target

Supervised learning at scale — Backprop remains more sample-efficient for standard supervised tasks
Single-example gradient computation — EGGROLL requires a population, minimum batch size is the population
Low-latency training — EGGROLL's strength is throughput, not latency per update
Tasks where backprop works well — No reason to replace backprop for standard differentiable objectives

5 LLM Integration

EGGROLL as an LLM Training Method

Unlike systems that use LLMs as mutation operators (AlphaEvolve, FunSearch), EGGROLL trains LLMs directly via evolution strategies. The LLM is the object being optimized, not the optimizer.

AlphaEvolve:    LLM ──generates──► Code mutations ──evaluates──► Fitness
EGGROLL:        Random ──perturbs──► LLM weights ──evaluates──► Fitness ──updates──► LLM weights

LLM Architecture: RWKV-7

EGGROLL's primary LLM experiments use RWKV-7 ("Goose"), a linear-attention recurrent model:

Property	Value	Why Chosen
Architecture	Linear RNN (RWKV-7)	Constant memory per token during generation
Sizes tested	1.5B, 7B parameters	Demonstrates billion-scale feasibility
Base model	Pre-trained RWKV-7 Goose	Reasoning traces already in pretraining data
Framework	JAX	Efficient vmap for population parallelism
KV cache	Fixed size (unlike Transformers)	No dynamic memory allocation during generation

Why not Transformers? The growing KV-cache in Transformer architectures creates memory management challenges when running thousands of parallel population members. RWKV's fixed-size state means memory is predictable and constant regardless of sequence length. This is a practical engineering constraint, not an algorithmic limitation — EGGROLL's math works with any architecture.

Fitness Functions for LLM Training

Task	Fitness Function	Details
Countdown	Correctness of countdown sequence	Binary reward: correct final answer or not
GSM8K	Mathematical answer accuracy	Binary reward: correct numerical answer
Pretraining (EGG)	Cross-entropy loss	Bits per byte on MiniPile test set
General RL	Environment reward	Task-specific cumulative reward

Comparison with GRPO

EGGROLL is positioned as a competitor to GRPO (Group Relative Policy Optimization) for LLM reasoning:

Method	Gradient Type	Requirements	Population
GRPO	Policy gradient (backprop)	Differentiable loss, float arithmetic	Group size ~16-64
EGGROLL	ES gradient (forward-pass only)	Any fitness function, any arithmetic	Population up to ~10^6

6 Key Results

6.1 Throughput: 100x Speedup

The headline result — EGGROLL achieves up to 91% of pure batch inference throughput:

Throughput vs. Population Size (Billion-parameter model, H100)

Tokens/sec
(millions)
    │
 10 │ ●──●──●──●──●──●──●──●  Pure inference (upper bound)
    │ ○──○──○──○──○──○──○──○  EGGROLL (91% of inference)
  8 │
    │
  6 │
    │
  4 │
    │
  2 │                           ×
    │              ×
  0 │ ×──×──×  Naive ES (100x slower at large pop sizes)
    └─────────────────────────── Population size
      2^10  2^12  2^14  2^16  2^18  2^20

The 100x speedup comes from replacing batched matrix multiplications (low arithmetic intensity) with standard matrix multiplications plus batched vector operations (high arithmetic intensity).

6.2 EGG: Pure Integer Language Model Pretraining

EGGROLL enables a previously impossible experiment: training a nonlinear RNN entirely in int8:

Property	Value
Architecture	EGG (Evolved Generative GRU)
Dimensions	D=256, L=6 layers
Datatypes	All weights int8, computations int8 with int32 accumulation
Activation functions	None (int8 overflow provides implicit nonlinearity)
Dataset	MiniPile (character-level)
Tokens per second	10 million (single H100)
Population size	2^20 = 1,048,576
Test loss	3.40 bits/byte
Sequence sharing	16 sequences shared across population

Training Loss (EGG, bits/byte)
    │
 5.0│ ●
    │  ●
 4.5│   ●
    │    ●
 4.0│     ●●
    │       ●●
 3.5│         ●●●●●●
    │               ●●●●●●●●●●●●  → 3.40 bits/byte
 3.0│
    └──────────────────────────── Training steps
     0      500    1000   1500   2000

Key insight: The int8 tensor multiplication followed by int32 accumulation and cast back to int8 introduces implicit nonlinearity through overflow/saturation. This means no explicit activation functions are needed — the arithmetic format itself provides nonlinearity.

6.3 LLM Reasoning: Countdown Task

On the countdown task, EGGROLL outperforms GRPO and matches or exceeds concurrent ES-LLM results:

Model	Method	Accuracy
LLaMA-3.2 1B	GRPO (ES-LLM paper)	~60%
RWKV 1.5B	EGGROLL	~65%
Qwen 2.5 1.5B	GRPO (ES-LLM paper)	~70%
RWKV 7B	EGGROLL	~80%
All 7B models	GRPO (ES-LLM paper)	~70%

EGGROLL with RWKV-7B outperforms all reported 7B results from the concurrent ES-LLM paper despite using a weaker base model (RWKV vs. LLaMA/Qwen).

6.4 LLM Reasoning: GSM8K

On GSM8K (grade school math), EGGROLL also outperforms GRPO:

Model	Method	Accuracy
RWKV 1.5B	GRPO	Baseline
RWKV 1.5B	EGGROLL	Outperforms GRPO

6.5 Data Efficiency

EGGROLL demonstrates surprising data efficiency at large population sizes:

Population Size vs. Data Sharing

Solid lines:  512 population members share each sequence
Dashed lines: Only 2 members share each sequence (paired)

At large population sizes (2^20), both strategies achieve similar
performance — suggesting that ES can extract useful gradient signal
even when many population members evaluate the same data.

6.6 Tabula Rasa RL

In standard RL settings, EGGROLL matches naive ES performance without speed compromise:

"EGGROLL does not compromise performance compared to ES in tabula rasa RL settings, despite being faster."

7 Reproducibility

Open-Source Status

Component	Available	Repository	License
EGGROLL algorithm (JAX)	Yes	HyperscaleES	Open
Nano-EGG (single file)	Yes	nano-egg	Open
RWKV JAX implementation	Yes	jaxrwkv	Open
Pre-trained RWKV-7 weights	Yes	HuggingFace (BlinkDL)	Apache 2.0
MiniPile dataset	Yes	HuggingFace	Open
Countdown task	Yes	Standard benchmark	N/A
GSM8K dataset	Yes	HuggingFace	MIT

Reproducibility Assessment

Verdict: Highly reproducible. All core components are open-source, the algorithm is described in full mathematical detail, code is provided in multiple repositories, and the base models (RWKV-7) are freely available. The primary barrier is hardware: the headline experiments require H100 GPUs.

What Can Be Reproduced

The complete EGGROLL algorithm (JAX implementation provided)
EGG (int8 RNN) training via nano-egg single-file codebase
RWKV-7 fine-tuning on countdown and GSM8K
Throughput benchmarks on H100 GPUs
Theoretical convergence analysis (proofs in paper appendix)

Hardware Requirements for Reproduction

Experiment	Minimum Hardware	Ideal Hardware
Nano-EGG training	1x A100 80GB	1x H100
RWKV 1.5B fine-tuning	1x A100 80GB	4x H100
RWKV 7B fine-tuning	4x A100 80GB	8x H100
Throughput benchmarks	1x H100	8x H100
Full paper reproduction	8x H100	64x H100

Community Contribution Model

The nano-egg repository explicitly encourages community contributions, modeled after the nanogpt speedrun:

"We highly encourage community contributions, similar to the nanogpt speedrun, to see how efficient we can make pure evolution pretraining in integer formats!"

8 Compute and API Costs

Throughput Economics

EGGROLL's key economic insight is that ES training becomes nearly as cheap as inference. This has profound implications:

Cost Model (EGGROLL vs. Backprop for LLM Training)

                    Backprop              EGGROLL
                    ────────              ───────
Forward pass:       1x FLOPS             1x FLOPS (same)
Backward pass:      ~2x FLOPS            0 FLOPS (no backprop)
Optimizer state:    ~3x memory           0 memory (no Adam state)
Population:         1 (or micro-batch)   N members (parallelized)
                    ────────              ───────
Total FLOPS:        ~3x per sample       ~1x per sample × N population
Total memory:       Model + optimizer    Model + perturbation keys
                    + activations

H100 Throughput Numbers

Configuration	Tokens/Second	% of Inference	GPU Utilization
Pure batch inference	~10M tok/s	100%	~95%
EGGROLL (rank-1)	~9.1M tok/s	91%	~87%
Naive ES	~0.1M tok/s	1%	~5%
Backprop training	~3.3M tok/s	33%	~90%

Memory Costs

Component	Naive ES	EGGROLL (rank-1)
Perturbation storage per member	O(m × n) per layer	O(m + n) per layer
Total perturbation memory	~4x model size	~2/min(m,n) × model size
RNG state	Single seed per member	Single seed per member
Practical impact (7B model)	~28 GB per member	~0.01 GB per member

Cost Comparison for LLM Reasoning Training

Method	Hardware	Time	Estimated Cost (cloud)
GRPO (backprop, 1.5B)	4x A100	~4 hours	~$50
Naive ES (1.5B, pop=1K)	4x A100	~400 hours	~$5,000
EGGROLL (1.5B, pop=1M)	4x H100	~4 hours	~$80
GRPO (backprop, 7B)	8x A100	~12 hours	~$300
EGGROLL (7B, pop=1M)	8x H100	~8 hours	~$320

EGGROLL makes ES cost-competitive with backprop-based methods for the first time at billion-parameter scale.

9 Architecture Solution

System Architecture

┌──────────────────────────────────────────────────────────────┐
│                    EGGROLL Training System                    │
│                                                              │
│  ┌────────────────────────────────────────────────────────┐  │
│  │                   Population Manager                    │  │
│  │  ┌──────────────────────────────────────────────────┐  │  │
│  │  │  RNG Key → fold_in(key, thread_id) → (A_i, B_i) │  │  │
│  │  │  For each population member i = 1..N:             │  │  │
│  │  │    - Generate rank-r perturbation (A_i, B_i)      │  │  │
│  │  │    - No storage needed (recomputable from seed)    │  │  │
│  │  └──────────────────────────────────────────────────┘  │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                  Batched Forward Pass                    │  │
│  │                                                         │  │
│  │  For each layer with weight W ∈ R^{m×n}:               │  │
│  │    y_shared = x @ W^T           (standard matmul, 1x)  │  │
│  │    y_perturb = σ · (x @ B_i) @ A_i^T  (batched, fast) │  │
│  │    y_i = y_shared + y_perturb   (per population member)│  │
│  │                                                         │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                  Fitness Evaluation                      │  │
│  │                                                         │  │
│  │  For each population member i:                          │  │
│  │    - Generate output sequence                           │  │
│  │    - Compute fitness F_i (task-dependent)               │  │
│  │    - Return scalar fitness value                        │  │
│  │                                                         │  │
│  └────────────────────────┬───────────────────────────────┘  │
│                           │                                   │
│  ┌────────────────────────▼───────────────────────────────┐  │
│  │                   Parameter Update                       │  │
│  │                                                         │  │
│  │  Gradient estimate:                                     │  │
│  │    g = (1/Nσ) Σ_i F_i · B_i · A_i^T                   │  │
│  │                                                         │  │
│  │  Fused directly into parameters:                        │  │
│  │    W ← W + α · g    (full-rank update, rank ≤ N·r)    │  │
│  │                                                         │  │
│  └────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Data Flow for One Training Step

Step 1: Sample input data batch
         D = {x_1, x_2, ..., x_B}  (B sequences)

Step 2: Distribute across population
         Each of N population members gets some subset of D
         (or all members share all of D)

Step 3: Forward pass with EGGROLL perturbations
         For member i, layer l:
           y_i^l = x^l @ W_l^T + σ · (x^l @ B_i^l) @ (A_i^l)^T

Step 4: Generate outputs & compute fitness
         F_i = fitness(generate(model_i, input))

Step 5: Compute ES gradient
         For each layer l:
           g_l = (1/Nσ) Σ_i F_i · B_i^l · (A_i^l)^T

Step 6: Update parameters
         W_l ← W_l + α · g_l

Repeat steps 1-6.

Parallelism Model

GPU Memory Layout (Single H100, 80GB)

┌──────────────────────────────────────────────┐
│  Model Weights (shared):     ~14 GB (7B f16) │
│  ┌────────────────────────────────────────┐  │
│  │  Batch of population members           │  │
│  │  (vmap over thread_id dimension)       │  │
│  │                                        │  │
│  │  Thread 0: seed_0 → (A_0, B_0) → F_0 │  │
│  │  Thread 1: seed_1 → (A_1, B_1) → F_1 │  │
│  │  ...                                   │  │
│  │  Thread K: seed_K → (A_K, B_K) → F_K │  │
│  │                                        │  │
│  │  Each thread: ~0.01 GB overhead        │  │
│  └────────────────────────────────────────┘  │
│                                              │
│  Input data buffer:          ~2 GB           │
│  Activation memory:          ~10 GB          │
│  Remaining / fragmentation:  ~54 GB          │
└──────────────────────────────────────────────┘

Population members fit in the "remaining" budget.
With ~0.01 GB per member overhead:
  54 GB / 0.01 GB ≈ 5,400 members per GPU pass
  With gradient accumulation: millions of members possible

Multi-GPU Scaling

8x H100 (NVLink interconnect)

GPU 0: Members [0, N/8)           ──┐
GPU 1: Members [N/8, 2N/8)        ──┤
GPU 2: Members [2N/8, 3N/8)       ──┤
GPU 3: Members [3N/8, 4N/8)       ──┼──► AllReduce(fitness-weighted
GPU 4: Members [4N/8, 5N/8)       ──┤    perturbation sums)
GPU 5: Members [5N/8, 6N/8)       ──┤    → single gradient update
GPU 6: Members [6N/8, 7N/8)       ──┤
GPU 7: Members [7N/8, N)          ──┘

Communication: Only scalar fitness values + gradient updates
(NOT perturbation matrices — these are recomputable from seeds)

10 Component Breakdown

10.1 Random Number Generation (RNG) System

The perturbation generation uses JAX's splittable PRNG system:

def generate_perturbation(base_key, thread_id, shape, rank=1):
    """Generate a rank-r perturbation from a single base key + thread ID."""
    key = jax.random.fold_in(base_key, thread_id)
    m, n = shape
    params = jax.random.normal(key, (m + n, rank))
    B = params[:n]   # n x r
    A = params[n:]   # m x r
    return A, B

Properties: - Deterministic: Same (base_key, thread_id) always produces the same perturbation - No storage: Perturbations are recomputed on-the-fly, not stored - Parallelizable: fold_in is O(1) and independent across threads - Communication-free: Each GPU can reconstruct any member's perturbation from the shared base key

10.2 EGGROLL Forward Pass

The core computation replaces the standard linear layer:

def eggroll_linear(base_key, sigma, W, x, thread_id, rank=1):
    """EGGROLL-perturbed linear layer."""
    key = jax.random.fold_in(base_key, thread_id)
    m, n = W.shape

    perturbation_params = jax.random.normal(key, (m + n, rank))
    B = perturbation_params[:n]   # n x r
    A = perturbation_params[n:]   # m x r

    # Standard matmul (shared across population, computed once)
    y_base = x @ W.T

    # Per-member perturbation (batched, very fast)
    y_perturb = sigma * (x @ B) @ A.T

    return y_base + y_perturb


# Vectorize over population dimension
batched_eggroll = jax.vmap(eggroll_linear, in_axes=(None, None, None, 0, 0))

10.3 Fitness Evaluation Pipeline

Input Sequences → Model Forward Pass → Output Tokens → Fitness Score
      │                 │                    │               │
      │          EGGROLL perturbations       │               │
      │          (per population member)     │               │
      │                                      │               │
      └── Shared across members ─────────────┘               │
                                                             │
                                              ┌──────────────┘
                                              │
                                    Task-specific:
                                    • Countdown: exact match
                                    • GSM8K: numerical answer
                                    • LM: cross-entropy loss
                                    • RL: cumulative reward

10.4 Gradient Estimation

The ES gradient estimate in EGGROLL:

def eggroll_gradient(fitnesses, perturbation_keys, sigma, W_shape, rank=1):
    """Compute the EGGROLL gradient estimate."""
    N = len(fitnesses)
    m, n = W_shape

    gradient = jnp.zeros((m, n))

    for i in range(N):
        A_i, B_i = generate_perturbation(perturbation_keys[i], W_shape, rank)
        gradient += fitnesses[i] * (A_i @ B_i.T)

    return gradient / (N * sigma)

In practice, the gradient computation is fused into the parameter update and vectorized:

# Fused update (actual implementation)
def update_params(params, fitnesses, base_key, sigma, lr, rank=1):
    """Fuse gradient computation and parameter update."""
    centered_fitnesses = fitnesses - fitnesses.mean()

    for layer_name, W in params.items():
        # Reconstruct all perturbations and compute weighted sum
        # (vectorized, not an explicit loop)
        delta_W = vmap_weighted_perturbation_sum(
            centered_fitnesses, base_key, layer_name, W.shape, rank
        )
        W -= lr * delta_W / (len(fitnesses) * sigma)

10.5 EGG Architecture (Evolved Generative GRU)

The EGG model is a custom architecture designed to demonstrate EGGROLL's unique capabilities:

EGG Architecture (D=256, L=6)

Input tokens (int8 indices)
      │
      ▼
┌───────────────┐
│  Embedding     │  int8 lookup table
│  (256-dim)     │
└───────┬───────┘
        │ (int8)
┌───────▼───────┐
│  minGRU Block  │  No tanh, no sigmoid
│  (modified)    │  int8 matmul + int32 accumulate + int8 cast
│  × 6 layers    │  Implicit nonlinearity from int8 overflow
│                │
│  MLP Block     │  No activation functions
│  (no act fn)   │  int8 matmul only
└───────┬───────┘
        │ (int8)
┌───────▼───────┐
│  Output head   │  int8 → int32 logits
│  + softmax     │  Softmax via lookup table (no float)
└───────┬───────┘
        │
      Loss (bits/byte)

Key design decisions: - All int8 weights: Fastest datatype on H100 Tensor Cores - No activation functions: The int8→int32→int8 cast chain introduces implicit nonlinearity - No optimizer state: EGGROLL has no momentum, no Adam states — just parameter updates - Lookup-table softmax: Even the loss computation avoids floating point

10.6 RWKV-7 Integration

For LLM fine-tuning experiments, EGGROLL wraps the existing RWKV-7 model:

RWKV-7 "Goose" Architecture
      │
      ▼
┌────────────────────┐
│  Token embedding    │
└────────┬───────────┘
         │
┌────────▼───────────┐
│  RWKV-7 Block × N  │
│  ┌───────────────┐ │
│  │  Time mixing   │ │  ← EGGROLL perturbs these weights
│  │  (linear attn) │ │
│  └───────────────┘ │
│  ┌───────────────┐ │
│  │  Channel mix   │ │  ← EGGROLL perturbs these weights
│  │  (FFN variant) │ │
│  └───────────────┘ │
└────────┬───────────┘
         │
┌────────▼───────────┐
│  Language model head│
└────────┬───────────┘
         │
      Generate response → Evaluate fitness

11 Core Mechanisms (Detailed)

11.1 Low-Rank Perturbation Theory

The mathematical foundation of EGGROLL rests on replacing the standard ES gradient estimator with a structured variant.

Standard ES gradient:

∇_θ E[F(θ + σε)] = (1/σ) E[F(θ + σε) · ε],   ε ~ N(0, I_d)

EGGROLL gradient (for matrix parameter W ∈ R^{m×n}):

∇_W E[F(W + σ·BA^T)] = (1/σ) E[F(W + σ·BA^T) · BA^T],   A ~ N(0, I_m), B ~ N(0, I_n)

Consistency Theorem (informal): As the parameter dimension d → ∞, the EGGROLL gradient estimate converges to the standard ES gradient estimate at rate O(1/r), where r is the perturbation rank.

Linearizing Effect: The paper proves that in high dimensions, the objective function locally linearizes around the current parameters, meaning rank-1 perturbations capture the dominant gradient direction. This is analogous to how random projections preserve distances in high dimensions (Johnson-Lindenstrauss lemma).

11.2 Arithmetic Intensity Analysis

The key to EGGROLL's speedup is arithmetic intensity — the ratio of compute (FLOPS) to memory bandwidth:

Operation                     FLOPS          Memory Bytes     Arithmetic Intensity
─────────────────────────────────────────────────────────────────────────────────
Standard matmul (m×k × k×n):  2mkn           2(mk + kn + mn)  mkn / (mk+kn+mn)
                                                                ≈ k (for m=n=k)

Batched matmul (N × m×k × k×n): 2Nmkn        2N(mk + kn + mn) Same per-element
                               But N separate kernel launches → low GPU utilization

EGGROLL decomposition:
  x @ W^T (shared):           2mkn           2(mk + kn + mn)  ≈ k (high)
  x @ B (batched, B ∈ R^{n×r}): 2Nkr        2N(kr + nr + kr) ≈ min(k,r) (ok)
  (xB) @ A^T (batched):       2Nmr           2N(mr + mr + mr) ≈ r/3 (fast for r=1)

For typical transformer hidden dimensions (k = 4096) and rank r = 1: - Standard matmul: intensity ≈ 4096 (excellent) - EGGROLL perturbation: intensity ≈ 1–4096 (architecture-dependent, but much better than batched full-rank)

11.3 Population Scaling Dynamics

EGGROLL enables population sizes 3 orders of magnitude larger than prior ES work:

Population Size Regimes

OpenAI ES (2017):      N ≈ 1,000      │ Full-rank perturbations
                                       │ Small NNs (MuJoCo)
                                       │
ES-LLM (2025):        N ≈ 10          │ Small population, many rollouts
                                       │ per member to reduce variance
                                       │
EGGROLL:               N ≈ 1,000,000  │ Rank-1 perturbations
                                       │ Billion-parameter models
                                       │ High throughput

Theoretical implication:
- Gradient estimate variance ∝ 1/N
- At N = 10^6, variance is 1000x lower than N = 10^3
- This enables stable updates with larger learning rates

11.4 Noise-Reuse for Sequence Processing

For language modeling, EGGROLL incorporates Noise-Reuse ES (Vicol et al., 2023):

Standard ES on sequences:
  Each token position: new perturbation → O(T) perturbation samples
  Memory: O(T × d) per population member

Noise-Reuse ES:
  Reuse the same perturbation across multiple token positions
  Take multiple parameter updates within a single sequence
  Memory: O(d) per population member (independent of T)

Timeline for one sequence (T=100 tokens):
┌──────┬──────┬──────┬──────┬──────┐
│tok 1 │tok 25│tok 50│tok 75│tok100│
│perturb│      │update│      │update│
│  ε_1  │ ε_1  │  ε_2 │ ε_2  │ done │
└──────┴──────┴──────┴──────┴──────┘
         Same perturbation reused
         Update after every K tokens

11.5 Integer Arithmetic as Nonlinearity

The most conceptually striking mechanism in the paper is the use of integer overflow as a source of nonlinearity:

Float32 behavior:                    Int8 behavior:
  3.0 × 50.0 = 150.0 (linear)        3 × 50 = 150 → 127 (saturated!)
  3.0 × 100.0 = 300.0 (linear)       3 × 100 = 300 → 44 (overflow!)

Int8 multiplication chain:
  Input (int8) → Matmul → Accumulate (int32) → Cast (int8)
                                                  ↑
                                          Implicit nonlinearity!
                                          Values > 127 wrap/saturate
                                          Creates sigmoid-like behavior

This is inspired by prior OpenAI work showing that "nonlinear computation in deep linear networks" can emerge from floating-point rounding. EGGROLL takes this further: int8's extreme quantization makes the nonlinearity pronounced enough to train useful models.

11.6 Update Rule for EGG (Integer)

The EGG model uses a specialized update rule for integer parameters:

def egg_update(W_int8, gradient_estimate, threshold=1):
    """Integer-compatible parameter update."""
    # Threshold: only update if gradient is large enough
    update = jnp.where(
        jnp.abs(gradient_estimate) > threshold,
        jnp.sign(gradient_estimate),  # Step by ±1 in int8
        0
    )
    return jnp.clip(W_int8 + update, -128, 127).astype(jnp.int8)

Properties: - No learning rate (step size is always ±1 in int8 space) - No momentum or optimizer state - Threshold prevents noise from dominating updates - clip ensures int8 range

12 Programming Language

Implementation Stack

Component	Language	Framework	Why
EGGROLL core	Python	JAX	vmap for population parallelism, XLA compilation
EGG model	Python	JAX	int8 tensor operations via jnp
RWKV-7 model	Python	JAX (jaxrwkv)	Custom RWKV port for JAX
Throughput benchmarks	Python	JAX + CUDA	Hardware-level profiling
Nano-egg	Python	JAX	Single-file implementation

Why JAX?

JAX is the enabling technology for EGGROLL. Several JAX features are critical:

jax.vmap: Automatically vectorizes the forward pass over the population dimension. Without vmap, implementing population parallelism would require explicit batching code.

# Without vmap: explicit loop (slow)
for i in range(N):
    y_i = forward(params, x[i], perturbation[i])

# With vmap: automatic vectorization (fast)
batched_forward = jax.vmap(forward, in_axes=(None, 0, 0))
y = batched_forward(params, x, perturbation_ids)

jax.random.fold_in: Deterministic PRNG that allows per-member perturbation generation without communication:

# Each member gets a unique but deterministic perturbation
key_i = jax.random.fold_in(base_key, thread_id)
perturbation_i = jax.random.normal(key_i, param.shape)

XLA compilation: JAX's ahead-of-time compilation fuses the EGGROLL operations into efficient GPU kernels, avoiding Python overhead.
Integer support: JAX supports int8 tensor operations via jnp.int8, enabling the EGG experiments.
Multi-device: JAX's pmap enables multi-GPU parallelism with minimal code changes.

Why Not PyTorch?

The paper's authors chose JAX specifically because: - PyTorch's vmap (functorch) is less mature than JAX's - PyTorch's PRNG system doesn't have an equivalent to fold_in for deterministic per-member perturbations - JAX's XLA compilation provides better kernel fusion for the EGGROLL computation pattern - PyTorch's int8 support is primarily for inference quantization, not training

Code Organization

HyperscaleES/
├── eggroll/
│   ├── core.py              # EGGROLL algorithm implementation
│   ├── perturbation.py      # Low-rank perturbation generation
│   ├── update.py            # Parameter update rules
│   └── utils.py             # Fitness normalization, logging
├── models/
│   ├── egg.py               # EGG (int8 GRU) architecture
│   ├── rwkv.py              # RWKV-7 wrapper for EGGROLL
│   └── mlp.py               # Simple MLP for RL experiments
├── tasks/
│   ├── countdown.py         # Countdown reasoning task
│   ├── gsm8k.py             # GSM8K evaluation
│   ├── language_model.py    # Character-level LM (MiniPile)
│   └── rl_envs.py           # RL environment wrappers
├── configs/
│   ├── egg_minipile.yaml    # EGG pretraining config
│   ├── rwkv_countdown.yaml  # RWKV countdown config
│   └── rwkv_gsm8k.yaml     # RWKV GSM8K config
└── scripts/
    ├── train.py             # Main training script
    ├── benchmark.py         # Throughput benchmarking
    └── evaluate.py          # Model evaluation

13 Memory Management

Perturbation Memory: The Key Innovation

The most important memory optimization in EGGROLL is that perturbations are never stored — they are regenerated from RNG seeds:

Naive ES memory:
  N population members × d parameters × 4 bytes (float32)
  = N × d × 4 bytes

  For N = 10^6, d = 7×10^9 (7B model):
  = 7 × 10^15 bytes = 7 PB  ← Obviously impossible!

EGGROLL memory:
  N population members × 1 seed (8 bytes)
  + shared model parameters (d × 2 bytes for float16)

  For N = 10^6, d = 7×10^9:
  = 8 MB (seeds) + 14 GB (model) = 14.008 GB  ← Fits on one GPU!

Gradient Accumulation Memory

The gradient is accumulated online without storing all perturbations:

# Memory-efficient gradient accumulation
gradient = jnp.zeros_like(W)
for batch_of_members in chunks(range(N), chunk_size):
    # Generate perturbations on-the-fly
    perturbations = vmap(generate_perturbation)(keys[batch_of_members])
    # Evaluate fitness
    fitnesses = vmap(evaluate)(perturbations)
    # Accumulate gradient
    gradient += jnp.einsum('i,ijk->jk', fitnesses, perturbations)

Peak memory: model_size + chunk_size × perturbation_overhead

Memory Budget Breakdown (7B RWKV on 8× H100)

Component	Per-GPU Memory	Total
Model weights (float16, replicated)	14 GB	14 GB × 8
Input data buffer	2 GB	2 GB × 8
Activation memory (per chunk)	8 GB	8 GB × 8
Perturbation overhead (per chunk)	0.5 GB	0.5 GB × 8
Gradient accumulator	14 GB	14 GB × 8
Framework overhead	2 GB	2 GB × 8
Total	~40.5 GB	—
Available (H100 80GB)	80 GB	—
Headroom	~39.5 GB	—

Memory Comparison: EGGROLL vs. Backprop (Adam)

Component	Backprop + Adam	EGGROLL
Model weights	14 GB (7B × 2B)	14 GB (same)
Gradients	14 GB	0 (computed online)
Adam momentum (m)	14 GB	0 (no optimizer state)
Adam variance (v)	14 GB	0 (no optimizer state)
Activations (for backward)	20-40 GB	0 (no backward pass)
Perturbation seeds	0	8 MB
Gradient accumulator	0	14 GB
Total	62-82 GB	~28 GB

EGGROLL uses approximately half the memory of Adam-based backprop training for the same model, primarily because it eliminates optimizer states and activation storage for the backward pass.

14 Continued Learning

Within-Run Learning

EGGROLL's learning dynamics within a single training run:

Population-based gradient estimation: Each step produces a gradient estimate from N fitness evaluations. Unlike backprop (deterministic gradient), ES gradient has variance ∝ 1/N.
No momentum (EGG variant): The EGG model uses no optimizer state — each update is independent. This means there's no "memory" of past gradients.
Standard optimizers (RWKV variant): For LLM fine-tuning, EGGROLL can be combined with standard optimizers (SGD with momentum, Adam) applied to the ES gradient estimate.

Cross-Task Transfer

EGGROLL itself is a training algorithm, not a model. Transfer learning occurs at the model level:

Pre-trained RWKV-7 weights serve as the initialization for fine-tuning
EGGROLL fine-tuning preserves the base model's capabilities while adapting to new tasks
The same EGGROLL implementation works across tasks (countdown, GSM8K, RL) without modification

Meta-Learning Potential

EGGROLL opens several meta-learning possibilities not yet explored in the paper:

Direction	Description	Status
Learned rank selection	Adapt rank r during training based on gradient quality	Not implemented
Adaptive σ	Learn the perturbation scale per-layer	Standard ES technique, applicable
Population scheduling	Vary N over training (large early, small late)	Not explored
Multi-fidelity ES	Use cheap fitness approximations early, expensive later	Not explored
Fitness shaping	Learn the fitness transformation function	Standard ES technique, applicable

Relationship to LoRA and Continual Fine-Tuning

EGGROLL has an interesting relationship to LoRA-based continual learning:

LoRA fine-tuning:
  W' = W + B·A^T (always low-rank, r typically 16-64)
  Multiple tasks: merge adapters or keep separate

EGGROLL fine-tuning:
  W' = W + Σ(f_i · B_i · A_i^T) (full-rank after aggregation)
  Multiple tasks: standard multi-task fitness function

Key difference: EGGROLL's updates are full-rank, so there's no
"adapter" to merge — the model weights are directly modified.
This is more like full fine-tuning in terms of expressiveness,
but achieved through inference-speed forward passes only.

Self-Play and Multi-Agent Extensions

The authors mention ongoing work on multi-agent optimization:

"We think that EGGROLL has strong potential to directly optimize LLMs with multi-agent awareness, breaking the best-of-k curse of RL."

This connects to their prior work on Social Deduction LLMs (Among Us) and suggests future applications in cooperative/competitive multi-agent training where end-to-end optimization through agent interactions is desirable but backprop through the interaction is intractable.

15 Applications

15.1 LLM Post-Training for Reasoning

The most immediately practical application: replacing or supplementing GRPO/RLHF for LLM reasoning training.

Advantage	Description
No reward model needed	Fitness is computed directly from task output
No KL penalty tuning	ES naturally stays near initialization (small σ)
Non-differentiable rewards	Can optimize for exact match, code execution, tool use
Chain-of-thought optimization	Fitness includes reasoning chain quality
Multi-turn optimization	Natural extension to dialogue/agent settings

Target tasks: - Mathematical reasoning (GSM8K, MATH, Olympiad problems) - Code generation (HumanEval, MBPP, SWE-bench) - Tool use optimization - Multi-step agent workflows

15.2 Novel Architecture Exploration

EGGROLL enables training architectures that are impractical with backprop:

Architecture Type	Why Backprop Fails	EGGROLL Solution
Pure integer NNs	No float gradients	Forward-pass only, int8 operations
Lookup-table layers	Non-differentiable	Black-box optimization
Discrete attention	Argmax not differentiable	Fitness-based selection
Spiking neural networks	Non-differentiable spikes	Direct fitness optimization
Neuromorphic architectures	Hardware-specific ops	Hardware-in-the-loop training
State machines + NNs	Discrete transitions	End-to-end fitness

15.3 Neurosymbolic System Optimization

The paper highlights neurosymbolic optimization as a key future direction:

"We are particularly interested in the end-to-end optimization of neurosymbolic systems, since EGGROLL naturally handles nondifferentiable components within a model."

Neurosymbolic System (optimized end-to-end by EGGROLL):

Input → [Neural Encoder] → [Symbolic Reasoner] → [Neural Decoder] → Output
              ↑                    ↑                     ↑
         Differentiable      Non-differentiable      Differentiable
              ↑                    ↑                     ↑
         ├──── EGGROLL optimizes ALL components simultaneously ────┤

Specific targets mentioned: - ROSA architecture for RWKV-8 (discrete memory system) - LLMs with external tool calls (non-differentiable tool invocations) - Code-writing agents (discrete code generation + execution)

15.4 Discrete Diffusion Model Training

The paper mentions discrete diffusion models as a target:

"We would also like to enable us to try EGGROLL on other LLMs, including Discrete Diffusion models for which the standard policy gradient theorem is technically intractable (due to the mask-based sampling procedure)."

Discrete diffusion models (MDLM, SEDD, etc.) generate text by iteratively demasking tokens. The masking/demasking procedure is non-differentiable, making standard policy gradients inapplicable. EGGROLL's black-box nature makes it directly applicable.

15.5 Hardware-Accelerated Integer Training

The EGG experiments suggest a new paradigm for hardware-efficient training:

Current Training Pipeline:
  Model (float16/bfloat16) → Forward (float16) → Backward (float16) → Adam (float32)
  GPU utilization: ~30-50% of peak FLOPS

EGGROLL Integer Pipeline:
  Model (int8) → Forward (int8 matmul, int32 accum) → Fitness → Update (int8)
  GPU utilization: ~80-91% of peak FLOPS (inference-speed)

H100 Peak FLOPS by Datatype:
  float32:    67 TFLOPS
  float16:    989 TFLOPS
  bfloat16:   989 TFLOPS
  int8:       1,979 TOPS     ← 2x more than float16!

EGGROLL with int8 can theoretically achieve 2x the throughput
of float16 training, on top of the ~100x ES speedup.

15.6 Reinforcement Learning

While demonstrated only on standard benchmarks in this paper, EGGROLL is naturally suited to RL:

RL Setting	EGGROLL Advantage
Sparse rewards	No gradient through sparse signal needed
Multi-agent	End-to-end team optimization
Real-world robotics	No simulator differentiability required
Safety-constrained	Fitness includes safety penalties directly
Sim-to-real transfer	Optimize directly in the target environment

15.7 Scientific Discovery

EGGROLL could be applied to optimize scientific models where the evaluation is non-differentiable:

Domain	Model	Fitness Function
Drug discovery	Molecular generator NN	Binding affinity (docking score)
Materials science	Crystal structure predictor	DFT energy (non-differentiable)
Protein design	Structure prediction NN	Experimental stability
Climate modeling	Neural weather model	Forecast accuracy

15.8 Limitations and Boundary Conditions

Limitation	Impact	Mitigation
Sample efficiency	ES needs many evaluations	Large population + high throughput
Gradient quality at rank-1	May miss important directions	Increase rank r if needed
Exploration vs. exploitation	Fixed σ limits exploration	Adaptive σ scheduling
No second-order information	Cannot exploit curvature	Larger population compensates
Hardware requirements	H100 for headline results	Scales down to A100, less dramatically
JAX ecosystem	Smaller than PyTorch	PyTorch port underway
Transformer support	KV-cache memory issue	vLLM/Megatron port in progress

Paradigm	Gradient	Memory	Throughput	Generality
Backpropagation	Exact	High (activations)	~30% of inference	Differentiable only
REINFORCE	Noisy	Low	~50% of inference	Any reward
PPO/GRPO	Noisy + baseline	Medium	~30% of inference	Any reward
Standard ES	Noisy	Very high (perturbations)	~1% of inference	Any fitness
EGGROLL	Noisy (low-rank)	Low	~91% of inference	Any fitness
Zeroth-order methods	Finite differences	Low	~50% of inference	Any function

15.10 The EGGROLL Vision: Inference IS Training

The paper's most provocative claim is that EGGROLL fundamentally changes the relationship between inference and training:

Traditional view:
  Training ≠ Inference
  Training: forward + backward + optimizer = expensive
  Inference: forward only = cheap
  Cost ratio: Training / Inference ≈ 3-5x

EGGROLL view:
  Training ≈ Inference
  Training: forward + perturbation + fitness = almost as fast as inference
  Inference: forward only
  Cost ratio: Training / Inference ≈ 1.1x

Implication: Any system that can run batched inference can also TRAIN,
with only 10% overhead. This means:
  - Edge devices can self-improve
  - Inference servers can simultaneously fine-tune
  - Any fitness function becomes a training signal
  - The distinction between deployment and training dissolves

This vision, if realized at larger scale and across more architectures, would represent a fundamental shift in how ML systems are deployed and improved.

This analysis is based on the arXiv paper (2511.16652v2), the project website (eshyperscale.github.io), the open-source code repositories (HyperscaleES, nano-egg, jaxrwkv), and supplementary information from the AlphaXiv discussion page. EGGROLL represents a significant advance in making evolution strategies practical for billion-parameter model training, with implications spanning LLM post-training, novel architecture exploration, and the fundamental relationship between inference and learning.