Introduced2017-03
Score7.81/10 — Draft
Chapter 30

Evolution Strategies at Scale

Part P06: Evolutionary Scaling & Efficiency

Key Contribution

Evolution Strategies at Scale (Qiu et al., ICML 2026) is the first successful application of Evolution Strategies to full-parameter fine-tuning of billion-parameter LLMs without dimensionality reduction. Using a population of just 30, a fixed noise scale of $\sigma = 0.001$, and a single hyperparameter configuration across all models, ES outperforms PPO, GRPO, and Dr.GRPO on every tested model (0.5B–8B parameters) while requiring approximately 7–8× less GPU memory. The work overturns the widely held assumption — rooted in Vemula et al. (2019) — that parameter-space exploration is intractable at modern model scales, and positions ES as a viable backpropagation-free post-training paradigm alongside reinforcement learning.

30.1 Overview and Motivation

Post-training optimization of large language models has been dominated by reinforcement learning methods — PPO, GRPO, DPO, and their variants — since the success of RLHF in aligning models such as ChatGPT. These methods operate in action space: they treat token generation as a sequential decision process and optimize policies via gradient-based updates that require backpropagation through the entire model. This approach carries substantial computational overhead — gradient buffers, optimizer states, activation caches, and often a frozen reference model for KL regularization — and is notoriously sensitive to hyperparameter choices, frequently requiring per-model tuning sweeps.

Evolution Strategies (ES), by contrast, operate in parameter space. Rather than computing gradients through the model's computation graph, ES evaluates the model under random perturbations of its parameters and uses the resulting fitness signals to estimate a gradient of the expected reward. This zeroth-order approach requires only forward passes — no backpropagation, no gradient storage, no optimizer states. The idea is old: Natural Evolution Strategies (NES) were formalized by Wierstra et al. (2008, 2014), and OpenAI demonstrated ES as an alternative to RL for Atari and MuJoCo control tasks (Salimans et al., 2017) using populations of 10,000+ perturbations on models with roughly 4 million parameters.

The conventional wisdom, however, held that ES could not scale to modern deep learning. Vemula et al. (2019) argued that the sample complexity of parameter-space exploration grows as $O(d^2)$ with dimensionality $d$, rendering it intractable for billion-parameter models. Prior attempts to apply ES to neural networks remained limited to millions of parameters (Lehman et al., 2018; Zhang et al., 2017) or resorted to dimensionality reduction — optimizing only the final layer (Toledano-López et al., 2022) or LoRA adapters (Jin et al., 2024). The scaling barrier appeared fundamental.

Qiu et al. (2025, 2026) overturn this assumption. Their paper, Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning, demonstrates that ES with a population of just 30 — three orders of magnitude smaller than OpenAI ES — successfully fine-tunes LLMs with 0.5B to 8B parameters on reasoning, math, and behavioral tasks. ES outperforms all tested RL baselines (PPO, GRPO, Dr.GRPO) on every model, using a single hyperparameter configuration across all experiments. The work was published at ICML 2026, with code available at github.com/VsonicV/es-fine-tuning-paper (340+ stars as of early 2026).

30.1.1 Historical Context: From OpenAI ES to Billion-Parameter ES

The lineage of this work traces through three major milestones in evolution strategies applied to neural networks:

SystemYearParametersPopulation SizeContribution
Salimans et al. (OpenAI ES)2017~4M10,000+ES as RL alternative for control
Zhang et al.2017~3M10,000+ES for RL policy optimization
Lehman et al.2018~167KLargeES with novelty search
Jin et al.20241,600 (LoRA)SmallES on low-rank adapters only
Qiu et al. (this work)2025–20260.5B–8B30Full-parameter ES at billion scale

The jump from millions to billions of parameters, combined with the reduction from 10,000+ to 30 perturbations per iteration, represents a qualitative shift. Both changes were individually assumed to be fatal to ES performance; together they should have been catastrophic. The empirical success of this combination is the paper's central surprise.

30.1.2 Team and Institutional Context

The paper originates from Cognizant AI Labs, led by Babak Hodjat (co-founder of Sentient Technologies, one of the largest AI startups focused on evolutionary computation) and Risto Miikkulainen (UT Austin professor, inventor of NEAT). The team occupies a distinctive niche in the evolutionary AI landscape: while Google DeepMind uses LLMs to evolve code (AlphaEvolve, FunSearch) and Sakana AI uses evolution to merge models, Cognizant directly evolves model parameters — the most classical form of neuroevolution, applied at unprecedented scale. Elliot Meyerson, a co-author, is notable for prior work on language model crossover (2024), a precursor to the LLM-as-evolutionary-operator paradigm.

30.2 Architecture

The ES fine-tuning system follows a clean outer-loop / inner-loop architecture. The outer loop iterates $T$ times. In each iteration, the inner loop evaluates $N = 30$ independent perturbations of the current model parameters, collects fitness scores, normalizes them, and applies a weighted update to the model center. The entire process operates on a single model instance, with perturbations applied and restored in-place.

ES Fine-Tuning Architecture Outer Loop: T iterations Inner Loop: N = 30 Perturbations 1. Sample seed s_n 2. Perturb θ (in-place) 3. Greedy inference 4. Compute reward R_n 5. Restore θ (in-place) Repeat for n = 1 … N (parallelizable across GPUs) Reward Normalization R̃_n = (R_n − μ) / σ_R Decomposed Update For each layer ℓ, seed s_n: θ_ℓ += α·(1/N)·R̃_n·ε_ℓ Periodic Evaluation Held-out test set; checkpoint best GPU Parallelization GPU 0 s₁, s₅, s₉… GPU 1 s₂, s₆, s₁₀… GPU 2 s₃, s₇, s₁₁…

30.2.1 Paradigm Distinction

It is essential to distinguish this system from the LLM-as-mutation-operator paradigm studied throughout most of this book. In systems like AlphaEvolve (Chapter 4), FunSearch (Chapter 9), and OpenEvolve (Chapter 5), the LLM is a frozen tool that generates candidate programs; evaluation scores those programs; and evolutionary operators recombine them. The LLM's parameters never change. In this paper, the LLM is the optimization target: ES perturbs its billions of floating-point weights, evaluates the resulting model on a task, and uses the fitness signal to shift the weight distribution. No code is generated; no programs are evolved. The search space is continuous $\mathbb{R}^d$ for $d$ up to 8 billion.

PropertyLLM-as-Operator (e.g., AlphaEvolve)LLM-as-Target (ES at Scale)
Search spaceDiscrete (code, programs)Continuous (model weights)
LLM roleMutation/crossover operator (frozen)Optimization target (modified)
Solution typePrograms, algorithms, heuristicsModel parameter vectors
EvaluationRun generated codeRun perturbed model on task
Gradient useNone (black-box code eval)None (zeroth-order)

30.2.2 Component Inventory

The implementation (source: github.com/VsonicV/es-fine-tuning-paper) consists of the following components, each implemented as a section of a single-file Python script:

ComponentFunctionSource File(s)
Noise GeneratorProduces Gaussian perturbations via seeded PyTorch RNGAll es_fine-tuning_*.py
Layer-Level PerturbationIn-place add/subtract of noise, one layer at a timeAll scripts
Reward EvaluatorGreedy decoding → parse → binary/composite rewardTask-specific
Reward NormalizerZ-score normalization within each iterationAll scripts
Decomposed UpdaterLayer × seed parameter update with minimal peak memoryAll scripts
Parallelization ManagerMulti-GPU distribution via HuggingFace Accelerate or vLLMaccelerate / *_accl.py

The repository provides two noise variants (correlated and i.i.d.) and an accelerated version using vLLM. The entire codebase is approximately 1,100–1,800 lines of Python, a notable testament to the algorithm's simplicity.

30.3 Core Algorithms

30.3.1 Natural Evolution Strategies: Theoretical Foundation

The algorithm is grounded in Natural Evolution Strategies (NES; Wierstra et al., 2008, 2014). Rather than optimizing a single parameter vector $\theta$, NES optimizes a search distribution $\pi_\psi(\theta)$ parameterized by $\psi$. The objective is to maximize the expected reward:

$$J(\psi) = \mathbb{E}_{\theta \sim \pi_\psi}\bigl[R(\theta)\bigr]$$

where $R(\theta)$ is the task reward obtained by running the model parameterized by $\theta$. For a Gaussian search distribution $\pi_\psi = \mathcal{N}(\mu, \sigma^2 I)$ with fixed isotropic covariance $\sigma^2 I$, the search parameter $\psi$ reduces to the mean $\mu$. The gradient of $J$ with respect to $\mu$ can be estimated via the log-likelihood ratio trick (also called the REINFORCE estimator):

$$\nabla_\mu J = \frac{1}{\sigma} \mathbb{E}_{\varepsilon \sim \mathcal{N}(0, I)}\bigl[R(\mu + \sigma \varepsilon) \cdot \varepsilon\bigr]$$

where $\varepsilon \sim \mathcal{N}(0, I)$ is a standard normal perturbation vector of the same dimensionality $d$ as $\mu$ (i.e., the number of model parameters). Approximating the expectation with $N$ Monte Carlo samples yields the update rule:

$$\mu \leftarrow \mu + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n \cdot \varepsilon_n$$

where $R_n = R(\mu + \sigma \varepsilon_n)$ is the reward from the $n$-th perturbation, $\varepsilon_n \sim \mathcal{N}(0, I)$, and $\alpha$ is the learning rate. Note that the paper absorbs the factor $1/\sigma$ into $\alpha$, which they term learning rate digestion — this reduces the hyperparameter count by one.

Variable definitions: $\mu \in \mathbb{R}^d$ is the current model parameter vector (the "center" of the search distribution); $d$ is the total number of parameters (0.5B–8B); $\sigma$ is the noise scale (fixed at 0.001); $\alpha$ is the learning rate (set to $5 \times 10^{-4}$); $N = 30$ is the population size; $R_n \in \mathbb{R}$ is the scalar reward for the $n$-th perturbation; and $\varepsilon_n \in \mathbb{R}^d$ is the $n$-th Gaussian noise vector.

30.3.2 Simplifications Relative to Standard NES

The paper deliberately strips NES down to its minimal form, removing enhancements that are standard in the ES literature:

Standard EnhancementIncluded?Rationale (per paper)
Covariance matrix adaptation (CMA)NoFull covariance is $O(d^2)$ — intractable at $d = 8 \times 10^9$
Rank transformation of rewardsNoIsolates core algorithm performance
Mirrored sampling (antithetic pairs)NoSimplifies implementation and analysis
Weight decayNoAvoids interference with controlled experiments
Adam-style optimizer for updateNoUses simple SGD-style update

The paper explicitly notes: "This design choice isolates the core ES algorithm and demonstrates that strong performance can be achieved without auxiliary enhancements." This minimalism is itself a contribution — it shows that raw ES outperforms heavily tuned RL.

30.3.3 Seven Implementation Innovations

While the algorithm is minimal, seven engineering innovations are required to make it tractable at billion-parameter scale. These are documented in the paper and implemented in the repository:

Innovation 1: Seed-Based Noise Storage

Naively storing $N = 30$ noise vectors of dimension $d = 8 \times 10^9$ would require $30 \times 8 \times 10^9 \times 4 \text{ bytes} = 960$ GB — clearly infeasible. Instead, the system stores only $N$ random seeds (integers). Each seed deterministically generates the same noise vector via torch.randn with a fixed generator state. This reduces storage from 960 GB to 240 bytes.

Innovation 2: Layer-Level In-Place Perturbation

Even generating the full noise vector for a single perturbation (32 GB for 8B parameters in float32) is prohibitive. The system instead iterates over model layers, generating and applying noise one layer at a time. Peak memory is bounded by the size of the largest single layer (typically 0.1–2 GB), not the full model.

Innovation 3: Decomposed Parameter Update

The standard update $\Delta\theta = \alpha \cdot \frac{1}{N} \sum_{n} \tilde{R}_n \cdot \varepsilon_n$ would require materializing the full $d$-dimensional update vector. The decomposed version iterates over layers and seeds, accumulating updates in-place:

$$\theta_\ell \leftarrow \theta_\ell + \frac{\alpha}{N} \sum_{n=1}^{N} \tilde{R}_n \cdot \varepsilon_{\ell,n} \quad \text{for each layer } \ell$$

where $\varepsilon_{\ell,n}$ is the noise for layer $\ell$ regenerated from seed $s_n$. This exploits the linearity of addition — the order of summation is immaterial.

Innovations 4–7

Innovation 4 — Z-score reward normalization: Rewards are normalized within each iteration as $\tilde{R}_n = (R_n - \bar{R}) / \text{std}(R)$, where $\bar{R}$ and $\text{std}(R)$ are the mean and standard deviation across the $N$ rewards. This ensures consistent gradient magnitudes across iterations and tasks. Innovation 5 — Greedy decoding: All evaluations use temperature 0 (greedy), eliminating sampling noise and ensuring that any performance difference is attributable solely to the parameter perturbation. Innovation 6 — Parallel evaluation: The $N$ perturbations are embarrassingly parallel — each GPU evaluates a subset of seeds independently, with only scalar rewards communicated. Innovation 7 — Learning rate digestion: The $1/\sigma$ factor from the NES gradient estimator is absorbed into $\alpha$, eliminating a redundant hyperparameter.

30.3.4 Algorithm Pseudocode

The following pseudocode is a faithful simplification of the implementation in countdown/es_fine-tuning_countdown.py from the repository:

# Pseudocode faithful to: github.com/VsonicV/es-fine-tuning-paper
# File: countdown/es_fine-tuning_countdown.py

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def es_fine_tune(model, tokenizer, train_data, N=30, sigma=0.001,
                 lr=5e-4, T=1000):
    """ES fine-tuning loop — single hyperparameter set for all models."""
    for iteration in range(T):
        seeds = [torch.randint(0, 2**32, (1,)).item() for _ in range(N)]
        rewards = []

        # Inner loop: evaluate N perturbations
        for seed in seeds:
            # Perturb model in-place, layer by layer
            rng = torch.Generator(device=model.device).manual_seed(seed)
            for param in model.parameters():
                noise = torch.randn(param.shape, generator=rng,
                                    device=param.device)
                param.data.add_(noise, alpha=sigma)

            # Greedy inference and reward computation
            reward = evaluate_on_batch(model, tokenizer, train_data)
            rewards.append(reward)

            # Restore model in-place using same seed
            rng = torch.Generator(device=model.device).manual_seed(seed)
            for param in model.parameters():
                noise = torch.randn(param.shape, generator=rng,
                                    device=param.device)
                param.data.sub_(noise, alpha=sigma)

        # Z-score normalize rewards
        rewards_t = torch.tensor(rewards)
        normalized = (rewards_t - rewards_t.mean()) / (rewards_t.std() + 1e-8)

        # Decomposed update: layer by layer, seed by seed
        for param in model.parameters():
            for seed, r_norm in zip(seeds, normalized):
                rng = torch.Generator(device=model.device).manual_seed(seed)
                # Note: must regenerate noise in same parameter order
                noise = torch.randn(param.shape, generator=rng,
                                    device=param.device)
                param.data.add_(noise, alpha=lr * r_norm.item() / N)

    return model

Implementation note: The actual repository code iterates over named model layers rather than raw parameters, and includes multi-GPU distribution via HuggingFace Accelerate. The decomposed update in the real code regenerates noise from seeds for each layer in sequence to maintain RNG state consistency. The accelerated version (es_fine-tuning_countdown_accl.py) replaces HuggingFace inference with vLLM engines for a reported 10× speed-up.

30.3.5 Memory-Efficient Perturbation: Detailed Analysis

The GPU memory comparison between RL and ES for an 8B-parameter model in bf16 precision is striking. The following table is derived from the paper's analysis:

Memory ComponentRL (PPO/GRPO)ES (This Paper)
Model parameters (bf16)~16 GB~16 GB
Gradient buffers (fp32)~32 GB0
Optimizer states (Adam: $m$, $v$)~64 GB0
Activation cache (backprop)~8–32 GB0
Reference model (KL penalty)~16 GB0
Layer-sized noise tensor (temp)0~0.1–2 GB
Inference KV cacheIncluded above~2–4 GB
Total~136–160 GB~18–22 GB

The approximately 7–8× memory reduction enables fine-tuning an 8B model on a single 80 GB GPU, whereas RL methods require 2–4 GPUs of the same capacity. This advantage scales favorably with model size — RL's overhead grows linearly with parameter count (gradient and optimizer state), while ES's overhead remains bounded by the largest layer.

30.4 Key Results

30.4.1 Countdown Task: ES Outperforms All RL Baselines

The headline result is on the Countdown symbolic reasoning task (Gandhi et al., 2024), where a model must combine given numbers using arithmetic operations to reach a target value. ES outperforms all RL baselines on every tested model, using a single hyperparameter configuration:

Base ModelParamsBaseBest RLBest RL MethodESES Δ vs Best RL
Qwen-2.5-0.5B-Instruct0.5B0.1%13.5%Dr.GRPO14.4%+0.9
Qwen-2.5-1.5B-Instruct1.5B0.7%31.0%Dr.GRPO37.3%+6.3
Qwen-2.5-3B-Instruct3B10.0%43.8%Dr.GRPO60.5%+16.7
Qwen-2.5-7B-Instruct7B31.2%57.5%Dr.GRPO66.8%+9.3
Llama-3.2-1B-Instruct1B0.4%14.9%GRPO-v16.8%+1.9
Llama-3.2-3B-Instruct3B3.2%47.8%Dr.GRPO51.6%+3.8
Llama-3.1-8B-Instruct8B8.1%51.3%GRPO-z 3061.2%+9.9

Source: Table 1 of Qiu et al. (2026). All ES experiments use N=30, σ=0.001, α=5×10⁻⁴. RL baselines include PPO, GRPO (variants v, z), and Dr.GRPO, each with per-model hyperparameter tuning. Training set: 200 sampled Countdown problems.

Three observations warrant emphasis. First, ES wins on every model — not a single RL variant beats ES on any model size or family. Second, the largest gains occur on mid-sized models: Qwen-2.5-3B shows a +16.7 absolute improvement over the best RL method. Third, ES achieves this with a single hyperparameter configuration across all seven models, while RL requires separate per-model sweeps.

30.4.2 Reward Hacking Resistance

The conciseness fine-tuning task reveals a qualitative behavioral difference between ES and RL. When fine-tuned for concise answers to knowledge questions (composite reward: correctness × conciseness penalty), GRPO degenerates to producing single-token or very short incoherent answers — maximizing the conciseness component by eliminating content. ES, by contrast, produces genuinely concise but coherent and correct responses.

The paper's explanation is mechanistic: ES optimizes a distribution of nearby parameter vectors (the Gaussian perturbation cloud), not a single point. For a reward-hacking behavior to persist under ES optimization, it must be robust to random Gaussian perturbation of all parameters — a much harder condition for degenerate behaviors to satisfy than the single-policy optimization that RL performs.

30.4.3 Cross-Run Stability

ES shows dramatically lower variance across independent runs with identical hyperparameters. The paper reports (on the Countdown task) standard deviations of ±1–3% for ES versus ±5–10% for GRPO and ±3–7% for Dr.GRPO. This stability has direct cost implications: if RL requires 3–5 runs for hyperparameter search plus 2–3 runs for reliability assessment, while ES requires a single run, the total compute for ES may be lower despite its per-iteration expense.

30.4.4 Math Reasoning Benchmarks

On standard math benchmarks (GSM8K, MATH500, Minerva Math, OlympiadBench), the paper reports that ES performs comparably to the best RL methods (GRPO, Dr.GRPO, DAPO), typically ranking in the top 2–3 across benchmarks. The paper emphasizes that the advantages of ES — robustness, stability, no reward hacking — do not come at the cost of reduced performance on well-studied tasks. Specific per-benchmark numbers are reported as competitive rather than as clear wins, distinguishing these results from the Countdown headline.

30.4.5 Six Systematic Advantages

The paper identifies six systematic advantages of ES over RL for LLM fine-tuning, supported by the experimental evidence:

  1. Long-horizon reward tolerance. ES requires only response-level (outcome) rewards, not token-level credit assignment. For reasoning tasks where only the final answer is graded, ES avoids the credit assignment problem entirely.
  2. Small populations in high-dimensional spaces. $N = 30$ suffices for spaces with billions of dimensions — overturning the assumption that population size must be proportional to dimensionality.
  3. Cross-model robustness. A single configuration works across Qwen-2.5 and Llama-3.x families (0.5B–8B). RL methods fail on some models, particularly smaller ones.
  4. Reward hacking resistance. Distributional optimization is harder to hack than single-solution optimization.
  5. Cross-run stability. Consistent results across runs; RL is often unstable.
  6. Memory efficiency. Inference-only operation eliminates gradient, optimizer, and activation storage.

30.5 Implementation Details

30.5.1 Code and Reproducibility

The full source code is available at github.com/VsonicV/es-fine-tuning-paper (CC BY-NC-SA 4.0 license). The repository structure is:

FilePurpose
es_fine-tuning_conciseness.pyConciseness task, correlated noise
es_fine-tuning_conciseness_iid.pyConciseness task, i.i.d. noise
countdown/es_fine-tuning_countdown.pyCountdown task, correlated noise
countdown/es_fine-tuning_countdown_iid.pyCountdown task, i.i.d. noise
es_fine-tuning_countdown_accl.pyAccelerated version (vLLM, 10× speed-up)
requirement.txtPython dependencies

All models are publicly available on HuggingFace (Qwen-2.5 and Llama-3.x families). Benchmarks use public datasets (Countdown, GSM8K, MATH500). The fixed hyperparameters ($N = 30$, $\sigma = 0.001$, $\alpha = 5 \times 10^{-4}$) are reported for all ES experiments. Seed control is partial: seeds are used for noise generation but not all sources of randomness are documented.

30.5.2 Execution Commands

The following commands are documented in the repository README:

# From repo: github.com/VsonicV/es-fine-tuning-paper
# Standard version with HuggingFace Accelerate (4 GPUs)
# Command from repository README

# accelerate launch \
#     --num_processes 4 \
#     --num_machines 1 \
#     countdown/es_fine-tuning_countdown.py \
#     --data_sample 200 \
#     --model_name Qwen/Qwen2.5-3B-Instruct \
#     --gpu_threads 1

# Accelerated version with vLLM (4 GPUs, ~10x faster)
# python es_fine-tuning_countdown_accl.py \
#     --model_name Qwen/Qwen2.5-3B-Instruct \
#     --cuda_devices 0,1,2,3 \
#     --num_engines 4 \
#     --population_size 30 \
#     --num_iterations 1000

30.5.3 Compute Costs

The paper provides sufficient detail to estimate training costs. For a typical Countdown experiment (1000 iterations, Qwen-2.5-3B, 4 × H100 GPUs):

VersionTime per IterationTotal TimeGPU-HoursEstimated Cloud Cost
Original (Accelerate)~2 min~33 hours~132 H100-hrs~$400–$500
Accelerated (vLLM)~12 sec~3.3 hours~13 H100-hrs~$40–$50

Provenance: Per-iteration timings are from the paper. Cloud costs are author estimates based on typical H100 rental rates (~$3/GPU-hr) and should be treated as approximate.

The cost advantage over RL is amplified by two factors. First, ES requires no hyperparameter sweeps — one configuration works for all models, while RL typically requires 3–5 sweeps per model. Second, ES's lower memory footprint enables using fewer or smaller GPUs. The paper estimates total RL costs at $500–$2,000 per model (including sweeps) versus $40–$50 for accelerated ES.

30.5.4 Accelerated Architecture

The accelerated version replaces HuggingFace's generate() with vLLM inference engines:

# Pseudocode reflecting: es_fine-tuning_countdown_accl.py
# Uses vLLM for high-throughput inference with continuous batching

from vllm import LLM, SamplingParams

def accelerated_es_fine_tune(model_name, cuda_devices, num_engines,
                              population_size=30, num_iterations=1000):
    """Accelerated ES using vLLM engines for ~10x inference speed-up."""

    # Initialize vLLM engines, one per GPU
    engines = []
    for device_id in cuda_devices[:num_engines]:
        engine = LLM(model=model_name,
                     tensor_parallel_size=1,
                     gpu_memory_utilization=0.9)
        engines.append(engine)

    sampling_params = SamplingParams(
        temperature=0.0,  # Greedy decoding — deterministic
        max_tokens=512
    )

    for iteration in range(num_iterations):
        seeds = generate_seeds(population_size)
        rewards = []

        # Distribute perturbations across engines
        for seed in seeds:
            engine = select_engine(engines)  # Round-robin or least-loaded
            perturb_vllm_weights(engine, seed, sigma=0.001)
            outputs = engine.generate(prompts, sampling_params)
            reward = compute_reward(outputs)
            rewards.append(reward)
            restore_vllm_weights(engine, seed, sigma=0.001)

        # Normalize and update (same decomposed strategy)
        normalized = z_score_normalize(rewards)
        decomposed_update(engines[0].model, seeds, normalized,
                          lr=5e-4, N=population_size)

        # Sync updated weights across engines
        broadcast_weights(engines)

The key speed-up comes from vLLM's continuous batching and PagedAttention — optimizations for inference throughput that are irrelevant during gradient-based training but transform ES, which is purely inference-based.

30.6 Why Small Populations Work

The most theoretically surprising finding is that $N = 30$ suffices for spaces with billions of dimensions. The paper does not provide a rigorous theoretical explanation, but discusses several hypotheses. We examine each.

30.6.1 Effective Dimensionality

LLM parameter spaces are highly structured. Pre-training creates strong correlations between parameters — most directions in parameter space have minimal impact on model output. The effective dimensionality (the number of directions that meaningfully affect behavior) is likely orders of magnitude smaller than the nominal parameter count. If the effective dimensionality $d_{\text{eff}} \ll d$, then $N = 30$ may provide adequate coverage of the relevant subspace even if it is far too small for uniform coverage of $\mathbb{R}^d$.

30.6.2 Pre-Training as Initialization

ES is not searching from scratch — it fine-tunes from a pre-trained model that already possesses the right structure for language tasks. The pre-trained initialization places $\mu$ in a region of parameter space where small perturbations (radius $\sigma = 0.001$) produce semantically meaningful behavioral changes. This is fundamentally different from the random-initialization setting analyzed by Vemula et al. (2019).

30.6.3 Binary Reward Signal Strength

For the Countdown task, rewards are binary (correct/incorrect). With $N = 30$ perturbations and any non-trivial success rate, the Z-score normalization produces a clear signal: perturbations that led to correct answers receive positive weight, others receive negative weight. The binary partition is informative regardless of the dimensionality of the perturbation.

30.6.4 Greedy Decoding as Amplifier

Without sampling noise, even small parameter changes produce detectable output differences. Stochastic decoding would mask the perturbation signal with sampling variance, potentially requiring larger populations to achieve the same signal-to-noise ratio. Greedy decoding makes each perturbation maximally informative.

These hypotheses remain to be formalized. The gap between the $O(d^2)$ lower bound of Vemula et al. and the empirical success at $N = 30$ represents an open theoretical question for the field.

30.7 Comparative Analysis

30.7.1 ES vs. RL: Systematic Comparison

DimensionRL (PPO/GRPO/Dr.GRPO)ES (This Paper)
Optimization spaceAction space (token sequences)Parameter space (model weights)
Gradient computationBackpropagation requiredZeroth-order (no backprop)
Credit assignmentToken-level or outcome-levelOutcome-level only
GPU memory (8B model)~136–160 GB~18–22 GB
Hyperparameter sensitivityVery high (per-model tuning)Low (single config for all models)
Cross-run stabilityLow (±5–10%)High (±1–3%)
Reward hackingSusceptibleResistant (distributional optimization)
Cross-model robustnessVariable (fails on some small models)Consistent across all tested models
Per-iteration costLower (single forward + backward)Higher (N=30 forward passes)
Total cost (including sweeps)Higher (multiple sweeps needed)Lower (single config)

30.7.2 ES vs. Other Zeroth-Order Methods

MeZO (Malladi et al., 2023) is a prior zeroth-order method for LLM fine-tuning that uses a single random perturbation per step (rather than a population). The paper positions ES favorably against MeZO:

MethodMemory (8B)QualityMechanism
MeZO~20 GBBelow RL baselinesSingle perturbation, two-point estimator
ES (this paper)~18 GBExceeds RL baselinesPopulation of 30, Z-score normalization

The population mechanism and reward normalization appear to be the critical differentiators that make ES succeed where single-perturbation zeroth-order methods fall short.

30.7.3 Position in the Evolutionary AI Landscape

ES at Scale occupies a unique position among the systems surveyed in this book. It is the only system that directly evolves model parameters at billion scale. The comparison with code-evolution systems is instructive:

Abstraction Level of Search Space Scale (Parameters / Complexity) Continuous Structured Programmatic Symbolic ES at Scale 0.5B–8B params Sakana Merging Merge weights AlphaEvolve Code evolution FunSearch Function search EvoPrompting Prompt evolution OpenAI ES (2017) ~4M params

ES at Scale is positioned at the lowest abstraction level (continuous parameter vectors) and the highest scale (billions of parameters). This directness is both its strength — no information loss through code, prompt, or merge abstraction — and its constraint: exploration is local in parameter space, bounded by the noise radius $\sigma$.

30.8 Learning Dynamics and Convergence

30.8.1 Training Trajectories

The paper documents distinctive learning dynamics for ES compared to RL. ES exhibits slower but monotonically increasing accuracy curves without the plateaus, oscillations, or reward hacking collapses that characterize RL training. The predictability of ES learning curves enables better estimation of required compute budgets — a practical advantage for research labs managing GPU allocations.

30.8.2 Population Dynamics

Unlike population-based evolutionary systems (AlphaEvolve, OpenEvolve, GEPA) that maintain multiple diverse candidates, ES maintains a single model center $\mu$ with a transient Gaussian cloud around it. Individual perturbations are ephemeral — they exist only for evaluation and are discarded immediately after. Only the center $\mu$ persists across iterations. This is conceptually closer to stochastic gradient descent (where $\mu$ is the parameter vector and $\varepsilon$ provides stochastic exploration) than to a traditional evolutionary algorithm with a persistent population.

$$\mu_{t+1} = \mu_t + \frac{\alpha}{N} \sum_{n=1}^{N} \tilde{R}_n \cdot \varepsilon_n \qquad \text{where } \varepsilon_n \sim \mathcal{N}(0, I)$$

This single-center design trades diversity for memory efficiency. Maintaining 30 full copies of an 8B model would require ~480 GB — the transient perturbation approach requires ~18 GB.

30.8.3 Hyperparameter Sensitivity

The paper reports hyperparameter sensitivity analysis across several dimensions:

HyperparameterES SensitivityRL SensitivityImplication
Learning rate ($\alpha$)LowVery highRL requires per-model tuning
Population / group size ($N$)LowModerate$N = 30$ works universally for ES
Noise scale ($\sigma$)ModerateN/AES-specific; $\sigma = 0.001$ is robust
KL penalty ($\beta$)N/AVery highWrong $\beta$ causes RL collapse or reward hacking

The noise scale $\sigma$ is the most ES-specific and most sensitive hyperparameter. The value $\sigma = 0.001$ means perturbations shift each parameter by a standard deviation of 0.1% of its magnitude — small enough to maintain model coherence, large enough to produce detectable behavioral changes under greedy decoding.

30.9 Limitations and Open Questions

The paper acknowledges several limitations that bound the generality of its claims:

Scale ceiling unknown. The largest model tested is 8B parameters. Whether ES remains effective at 70B or 405B is an open question. The effective dimensionality argument (Section 30.6.1) suggests it might, but empirical validation is needed.

Convergence speed. ES is slower per iteration than RL for some tasks. Each iteration requires $N = 30$ full forward passes, whereas RL requires one forward pass plus one backward pass. The paper argues that total cost is competitive when accounting for RL's hyperparameter sweeps, but this depends on the task and model.

Fixed exploration radius. The noise scale $\sigma$ is fixed throughout training. Adaptive schemes like CMA-ES (which adapts the full covariance matrix) could improve performance but are computationally intractable at billion-parameter scale in their standard form. Low-rank covariance approximations might bridge this gap.

No theoretical guarantees. The empirical success at $N = 30$ with $d = 8 \times 10^9$ is not theoretically explained. The analysis of Vemula et al. (2019) predicts poor performance at this scale. The paper falsifies this prediction empirically but does not provide an alternative theoretical framework.

Limited to post-training. All experiments start from pre-trained models. Whether ES could scale to pre-training from scratch — where the initialization argument (Section 30.6.2) does not apply — remains untested.

No cross-task transfer. Each experiment fine-tunes for a single task from the pre-trained model. Sequential fine-tuning, multi-task rewards, and curriculum learning are not explored.

Noise variant impact unclear. The repository provides both correlated and i.i.d. noise variants. The paper states both achieve similar results, but the theoretical implications of correlated noise in billion-dimensional spaces are not analyzed.

30.10 Implications for the Field

30.10.1 Neuroevolution Revived

This paper revives direct parameter-space optimization of neural networks — a research direction considered closed since the early 2010s when gradient-based methods became dominant. By demonstrating that ES works at billion-parameter scale, it reopens investigation into other evolutionary approaches to neural network optimization: CMA-ES variants, novelty-search in parameter space, quality-diversity methods, and hybrid gradient-evolutionary approaches.

30.10.2 Post-Training Paradigm

The paper positions ES as a third post-training paradigm alongside SFT and RL:

ParadigmGradientRequiresExploration
SFT (Supervised Fine-Tuning)Yes (backprop)Labeled dataNone (deterministic)
RL (PPO, GRPO, DPO, etc.)Yes (backprop)Reward model or ruleAction space (token sampling)
ES (this paper)No (zeroth-order)Reward function onlyParameter space (Gaussian)

30.10.3 Practical Applications

The paper suggests several applications where ES may be preferable to RL: (1) RLHF replacement, where reward hacking resistance is particularly valuable for alignment; (2) code generation fine-tuning, where test-case passage provides natural binary outcome rewards; (3) distributed fine-tuning, where ES's embarrassingly parallel nature and scalar-only communication make it ideal for multi-node or multi-datacenter deployment; and (4) safety alignment, where resistance to reward function exploitation is a critical property.

30.10.4 Impact Assessment

DimensionAssessmentNotes
Scientific noveltyVery HighFirst full-parameter ES at billion scale
Practical utilityHighCheaper, more stable, no reward hacking
ReproducibilityHighFull code, public models, fixed hyperparameters
GeneralityHighWorks across model families, sizes, and tasks
Theoretical depthMediumEmpirical strength, limited theoretical explanation
Community adoptionGrowing340+ GitHub stars, ICML 2026 acceptance

Source: Assessment categories and ratings are author analysis based on the evidence presented in the paper and repository metrics. Star counts are as reported in the source material (early 2026).

Summary

Key takeaway: Evolution Strategies can directly fine-tune billion-parameter LLMs with a population of just 30, outperforming reinforcement learning methods (PPO, GRPO, Dr.GRPO) on every tested model while using 7–8× less GPU memory and requiring no per-model hyperparameter tuning.

Main contribution: The paper overturns the widely held assumption that parameter-space exploration is intractable at modern model scales. Seven engineering innovations — seed-based noise, layer-level in-place perturbation, decomposed updates, Z-score normalization, greedy decoding, parallel evaluation, and learning rate digestion — collectively enable a minimal ES algorithm to operate in billion-dimensional spaces with remarkable efficiency.

What researchers should know: ES occupies a unique position in the evolutionary AI landscape as the only demonstrated method for direct, full-parameter evolutionary optimization of LLMs at scale. Its six systematic advantages over RL — long-horizon tolerance, small populations, cross-model robustness, reward hacking resistance, cross-run stability, and memory efficiency — make it a serious candidate for the post-training toolbox. The theoretical gap between the predicted $O(d^2)$ sample complexity and the observed $N = 30$ success remains the most important open question.