Evolution Strategies at Scale
Part P06: Evolutionary Scaling & Efficiency
Key Contribution
Evolution Strategies at Scale (Qiu et al., ICML 2026) is the first successful application of Evolution Strategies to full-parameter fine-tuning of billion-parameter LLMs without dimensionality reduction. Using a population of just 30, a fixed noise scale of $\sigma = 0.001$, and a single hyperparameter configuration across all models, ES outperforms PPO, GRPO, and Dr.GRPO on every tested model (0.5B–8B parameters) while requiring approximately 7–8× less GPU memory. The work overturns the widely held assumption — rooted in Vemula et al. (2019) — that parameter-space exploration is intractable at modern model scales, and positions ES as a viable backpropagation-free post-training paradigm alongside reinforcement learning.
30.1 Overview and Motivation
Post-training optimization of large language models has been dominated by reinforcement learning methods — PPO, GRPO, DPO, and their variants — since the success of RLHF in aligning models such as ChatGPT. These methods operate in action space: they treat token generation as a sequential decision process and optimize policies via gradient-based updates that require backpropagation through the entire model. This approach carries substantial computational overhead — gradient buffers, optimizer states, activation caches, and often a frozen reference model for KL regularization — and is notoriously sensitive to hyperparameter choices, frequently requiring per-model tuning sweeps.
Evolution Strategies (ES), by contrast, operate in parameter space. Rather than computing gradients through the model's computation graph, ES evaluates the model under random perturbations of its parameters and uses the resulting fitness signals to estimate a gradient of the expected reward. This zeroth-order approach requires only forward passes — no backpropagation, no gradient storage, no optimizer states. The idea is old: Natural Evolution Strategies (NES) were formalized by Wierstra et al. (2008, 2014), and OpenAI demonstrated ES as an alternative to RL for Atari and MuJoCo control tasks (Salimans et al., 2017) using populations of 10,000+ perturbations on models with roughly 4 million parameters.
The conventional wisdom, however, held that ES could not scale to modern deep learning. Vemula et al. (2019) argued that the sample complexity of parameter-space exploration grows as $O(d^2)$ with dimensionality $d$, rendering it intractable for billion-parameter models. Prior attempts to apply ES to neural networks remained limited to millions of parameters (Lehman et al., 2018; Zhang et al., 2017) or resorted to dimensionality reduction — optimizing only the final layer (Toledano-López et al., 2022) or LoRA adapters (Jin et al., 2024). The scaling barrier appeared fundamental.
Qiu et al. (2025, 2026) overturn this assumption. Their paper, Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning, demonstrates that ES with a population of just 30 — three orders of magnitude smaller than OpenAI ES — successfully fine-tunes LLMs with 0.5B to 8B parameters on reasoning, math, and behavioral tasks. ES outperforms all tested RL baselines (PPO, GRPO, Dr.GRPO) on every model, using a single hyperparameter configuration across all experiments. The work was published at ICML 2026, with code available at github.com/VsonicV/es-fine-tuning-paper (340+ stars as of early 2026).
30.1.1 Historical Context: From OpenAI ES to Billion-Parameter ES
The lineage of this work traces through three major milestones in evolution strategies applied to neural networks:
| System | Year | Parameters | Population Size | Contribution |
|---|---|---|---|---|
| Salimans et al. (OpenAI ES) | 2017 | ~4M | 10,000+ | ES as RL alternative for control |
| Zhang et al. | 2017 | ~3M | 10,000+ | ES for RL policy optimization |
| Lehman et al. | 2018 | ~167K | Large | ES with novelty search |
| Jin et al. | 2024 | 1,600 (LoRA) | Small | ES on low-rank adapters only |
| Qiu et al. (this work) | 2025–2026 | 0.5B–8B | 30 | Full-parameter ES at billion scale |
The jump from millions to billions of parameters, combined with the reduction from 10,000+ to 30 perturbations per iteration, represents a qualitative shift. Both changes were individually assumed to be fatal to ES performance; together they should have been catastrophic. The empirical success of this combination is the paper's central surprise.
30.1.2 Team and Institutional Context
The paper originates from Cognizant AI Labs, led by Babak Hodjat (co-founder of Sentient Technologies, one of the largest AI startups focused on evolutionary computation) and Risto Miikkulainen (UT Austin professor, inventor of NEAT). The team occupies a distinctive niche in the evolutionary AI landscape: while Google DeepMind uses LLMs to evolve code (AlphaEvolve, FunSearch) and Sakana AI uses evolution to merge models, Cognizant directly evolves model parameters — the most classical form of neuroevolution, applied at unprecedented scale. Elliot Meyerson, a co-author, is notable for prior work on language model crossover (2024), a precursor to the LLM-as-evolutionary-operator paradigm.
30.2 Architecture
The ES fine-tuning system follows a clean outer-loop / inner-loop architecture. The outer loop iterates $T$ times. In each iteration, the inner loop evaluates $N = 30$ independent perturbations of the current model parameters, collects fitness scores, normalizes them, and applies a weighted update to the model center. The entire process operates on a single model instance, with perturbations applied and restored in-place.
30.2.1 Paradigm Distinction
It is essential to distinguish this system from the LLM-as-mutation-operator paradigm studied throughout most of this book. In systems like AlphaEvolve (Chapter 4), FunSearch (Chapter 9), and OpenEvolve (Chapter 5), the LLM is a frozen tool that generates candidate programs; evaluation scores those programs; and evolutionary operators recombine them. The LLM's parameters never change. In this paper, the LLM is the optimization target: ES perturbs its billions of floating-point weights, evaluates the resulting model on a task, and uses the fitness signal to shift the weight distribution. No code is generated; no programs are evolved. The search space is continuous $\mathbb{R}^d$ for $d$ up to 8 billion.
| Property | LLM-as-Operator (e.g., AlphaEvolve) | LLM-as-Target (ES at Scale) |
|---|---|---|
| Search space | Discrete (code, programs) | Continuous (model weights) |
| LLM role | Mutation/crossover operator (frozen) | Optimization target (modified) |
| Solution type | Programs, algorithms, heuristics | Model parameter vectors |
| Evaluation | Run generated code | Run perturbed model on task |
| Gradient use | None (black-box code eval) | None (zeroth-order) |
30.2.2 Component Inventory
The implementation (source: github.com/VsonicV/es-fine-tuning-paper) consists of the following components, each implemented as a section of a single-file Python script:
| Component | Function | Source File(s) |
|---|---|---|
| Noise Generator | Produces Gaussian perturbations via seeded PyTorch RNG | All es_fine-tuning_*.py |
| Layer-Level Perturbation | In-place add/subtract of noise, one layer at a time | All scripts |
| Reward Evaluator | Greedy decoding → parse → binary/composite reward | Task-specific |
| Reward Normalizer | Z-score normalization within each iteration | All scripts |
| Decomposed Updater | Layer × seed parameter update with minimal peak memory | All scripts |
| Parallelization Manager | Multi-GPU distribution via HuggingFace Accelerate or vLLM | accelerate / *_accl.py |
The repository provides two noise variants (correlated and i.i.d.) and an accelerated version using vLLM. The entire codebase is approximately 1,100–1,800 lines of Python, a notable testament to the algorithm's simplicity.
30.3 Core Algorithms
30.3.1 Natural Evolution Strategies: Theoretical Foundation
The algorithm is grounded in Natural Evolution Strategies (NES; Wierstra et al., 2008, 2014). Rather than optimizing a single parameter vector $\theta$, NES optimizes a search distribution $\pi_\psi(\theta)$ parameterized by $\psi$. The objective is to maximize the expected reward:
where $R(\theta)$ is the task reward obtained by running the model parameterized by $\theta$. For a Gaussian search distribution $\pi_\psi = \mathcal{N}(\mu, \sigma^2 I)$ with fixed isotropic covariance $\sigma^2 I$, the search parameter $\psi$ reduces to the mean $\mu$. The gradient of $J$ with respect to $\mu$ can be estimated via the log-likelihood ratio trick (also called the REINFORCE estimator):
where $\varepsilon \sim \mathcal{N}(0, I)$ is a standard normal perturbation vector of the same dimensionality $d$ as $\mu$ (i.e., the number of model parameters). Approximating the expectation with $N$ Monte Carlo samples yields the update rule:
where $R_n = R(\mu + \sigma \varepsilon_n)$ is the reward from the $n$-th perturbation, $\varepsilon_n \sim \mathcal{N}(0, I)$, and $\alpha$ is the learning rate. Note that the paper absorbs the factor $1/\sigma$ into $\alpha$, which they term learning rate digestion — this reduces the hyperparameter count by one.
Variable definitions: $\mu \in \mathbb{R}^d$ is the current model parameter vector (the "center" of the search distribution); $d$ is the total number of parameters (0.5B–8B); $\sigma$ is the noise scale (fixed at 0.001); $\alpha$ is the learning rate (set to $5 \times 10^{-4}$); $N = 30$ is the population size; $R_n \in \mathbb{R}$ is the scalar reward for the $n$-th perturbation; and $\varepsilon_n \in \mathbb{R}^d$ is the $n$-th Gaussian noise vector.
30.3.2 Simplifications Relative to Standard NES
The paper deliberately strips NES down to its minimal form, removing enhancements that are standard in the ES literature:
| Standard Enhancement | Included? | Rationale (per paper) |
|---|---|---|
| Covariance matrix adaptation (CMA) | No | Full covariance is $O(d^2)$ — intractable at $d = 8 \times 10^9$ |
| Rank transformation of rewards | No | Isolates core algorithm performance |
| Mirrored sampling (antithetic pairs) | No | Simplifies implementation and analysis |
| Weight decay | No | Avoids interference with controlled experiments |
| Adam-style optimizer for update | No | Uses simple SGD-style update |
The paper explicitly notes: "This design choice isolates the core ES algorithm and demonstrates that strong performance can be achieved without auxiliary enhancements." This minimalism is itself a contribution — it shows that raw ES outperforms heavily tuned RL.
30.3.3 Seven Implementation Innovations
While the algorithm is minimal, seven engineering innovations are required to make it tractable at billion-parameter scale. These are documented in the paper and implemented in the repository:
Innovation 1: Seed-Based Noise Storage
Naively storing $N = 30$ noise vectors of dimension $d = 8 \times 10^9$ would require $30 \times 8 \times 10^9 \times 4 \text{ bytes} = 960$ GB — clearly infeasible. Instead, the system stores only $N$ random seeds (integers). Each seed deterministically generates the same noise vector via torch.randn with a fixed generator state. This reduces storage from 960 GB to 240 bytes.
Innovation 2: Layer-Level In-Place Perturbation
Even generating the full noise vector for a single perturbation (32 GB for 8B parameters in float32) is prohibitive. The system instead iterates over model layers, generating and applying noise one layer at a time. Peak memory is bounded by the size of the largest single layer (typically 0.1–2 GB), not the full model.
Innovation 3: Decomposed Parameter Update
The standard update $\Delta\theta = \alpha \cdot \frac{1}{N} \sum_{n} \tilde{R}_n \cdot \varepsilon_n$ would require materializing the full $d$-dimensional update vector. The decomposed version iterates over layers and seeds, accumulating updates in-place:
where $\varepsilon_{\ell,n}$ is the noise for layer $\ell$ regenerated from seed $s_n$. This exploits the linearity of addition — the order of summation is immaterial.
Innovations 4–7
Innovation 4 — Z-score reward normalization: Rewards are normalized within each iteration as $\tilde{R}_n = (R_n - \bar{R}) / \text{std}(R)$, where $\bar{R}$ and $\text{std}(R)$ are the mean and standard deviation across the $N$ rewards. This ensures consistent gradient magnitudes across iterations and tasks. Innovation 5 — Greedy decoding: All evaluations use temperature 0 (greedy), eliminating sampling noise and ensuring that any performance difference is attributable solely to the parameter perturbation. Innovation 6 — Parallel evaluation: The $N$ perturbations are embarrassingly parallel — each GPU evaluates a subset of seeds independently, with only scalar rewards communicated. Innovation 7 — Learning rate digestion: The $1/\sigma$ factor from the NES gradient estimator is absorbed into $\alpha$, eliminating a redundant hyperparameter.
30.3.4 Algorithm Pseudocode
The following pseudocode is a faithful simplification of the implementation in countdown/es_fine-tuning_countdown.py from the repository:
# Pseudocode faithful to: github.com/VsonicV/es-fine-tuning-paper
# File: countdown/es_fine-tuning_countdown.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def es_fine_tune(model, tokenizer, train_data, N=30, sigma=0.001,
lr=5e-4, T=1000):
"""ES fine-tuning loop — single hyperparameter set for all models."""
for iteration in range(T):
seeds = [torch.randint(0, 2**32, (1,)).item() for _ in range(N)]
rewards = []
# Inner loop: evaluate N perturbations
for seed in seeds:
# Perturb model in-place, layer by layer
rng = torch.Generator(device=model.device).manual_seed(seed)
for param in model.parameters():
noise = torch.randn(param.shape, generator=rng,
device=param.device)
param.data.add_(noise, alpha=sigma)
# Greedy inference and reward computation
reward = evaluate_on_batch(model, tokenizer, train_data)
rewards.append(reward)
# Restore model in-place using same seed
rng = torch.Generator(device=model.device).manual_seed(seed)
for param in model.parameters():
noise = torch.randn(param.shape, generator=rng,
device=param.device)
param.data.sub_(noise, alpha=sigma)
# Z-score normalize rewards
rewards_t = torch.tensor(rewards)
normalized = (rewards_t - rewards_t.mean()) / (rewards_t.std() + 1e-8)
# Decomposed update: layer by layer, seed by seed
for param in model.parameters():
for seed, r_norm in zip(seeds, normalized):
rng = torch.Generator(device=model.device).manual_seed(seed)
# Note: must regenerate noise in same parameter order
noise = torch.randn(param.shape, generator=rng,
device=param.device)
param.data.add_(noise, alpha=lr * r_norm.item() / N)
return model
Implementation note: The actual repository code iterates over named model layers rather than raw parameters, and includes multi-GPU distribution via HuggingFace Accelerate. The decomposed update in the real code regenerates noise from seeds for each layer in sequence to maintain RNG state consistency. The accelerated version (es_fine-tuning_countdown_accl.py) replaces HuggingFace inference with vLLM engines for a reported 10× speed-up.
30.3.5 Memory-Efficient Perturbation: Detailed Analysis
The GPU memory comparison between RL and ES for an 8B-parameter model in bf16 precision is striking. The following table is derived from the paper's analysis:
| Memory Component | RL (PPO/GRPO) | ES (This Paper) |
|---|---|---|
| Model parameters (bf16) | ~16 GB | ~16 GB |
| Gradient buffers (fp32) | ~32 GB | 0 |
| Optimizer states (Adam: $m$, $v$) | ~64 GB | 0 |
| Activation cache (backprop) | ~8–32 GB | 0 |
| Reference model (KL penalty) | ~16 GB | 0 |
| Layer-sized noise tensor (temp) | 0 | ~0.1–2 GB |
| Inference KV cache | Included above | ~2–4 GB |
| Total | ~136–160 GB | ~18–22 GB |
The approximately 7–8× memory reduction enables fine-tuning an 8B model on a single 80 GB GPU, whereas RL methods require 2–4 GPUs of the same capacity. This advantage scales favorably with model size — RL's overhead grows linearly with parameter count (gradient and optimizer state), while ES's overhead remains bounded by the largest layer.
30.4 Key Results
30.4.1 Countdown Task: ES Outperforms All RL Baselines
The headline result is on the Countdown symbolic reasoning task (Gandhi et al., 2024), where a model must combine given numbers using arithmetic operations to reach a target value. ES outperforms all RL baselines on every tested model, using a single hyperparameter configuration:
| Base Model | Params | Base | Best RL | Best RL Method | ES | ES Δ vs Best RL |
|---|---|---|---|---|---|---|
| Qwen-2.5-0.5B-Instruct | 0.5B | 0.1% | 13.5% | Dr.GRPO | 14.4% | +0.9 |
| Qwen-2.5-1.5B-Instruct | 1.5B | 0.7% | 31.0% | Dr.GRPO | 37.3% | +6.3 |
| Qwen-2.5-3B-Instruct | 3B | 10.0% | 43.8% | Dr.GRPO | 60.5% | +16.7 |
| Qwen-2.5-7B-Instruct | 7B | 31.2% | 57.5% | Dr.GRPO | 66.8% | +9.3 |
| Llama-3.2-1B-Instruct | 1B | 0.4% | 14.9% | GRPO-v | 16.8% | +1.9 |
| Llama-3.2-3B-Instruct | 3B | 3.2% | 47.8% | Dr.GRPO | 51.6% | +3.8 |
| Llama-3.1-8B-Instruct | 8B | 8.1% | 51.3% | GRPO-z 30 | 61.2% | +9.9 |
Source: Table 1 of Qiu et al. (2026). All ES experiments use N=30, σ=0.001, α=5×10⁻⁴. RL baselines include PPO, GRPO (variants v, z), and Dr.GRPO, each with per-model hyperparameter tuning. Training set: 200 sampled Countdown problems.
Three observations warrant emphasis. First, ES wins on every model — not a single RL variant beats ES on any model size or family. Second, the largest gains occur on mid-sized models: Qwen-2.5-3B shows a +16.7 absolute improvement over the best RL method. Third, ES achieves this with a single hyperparameter configuration across all seven models, while RL requires separate per-model sweeps.
30.4.2 Reward Hacking Resistance
The conciseness fine-tuning task reveals a qualitative behavioral difference between ES and RL. When fine-tuned for concise answers to knowledge questions (composite reward: correctness × conciseness penalty), GRPO degenerates to producing single-token or very short incoherent answers — maximizing the conciseness component by eliminating content. ES, by contrast, produces genuinely concise but coherent and correct responses.
The paper's explanation is mechanistic: ES optimizes a distribution of nearby parameter vectors (the Gaussian perturbation cloud), not a single point. For a reward-hacking behavior to persist under ES optimization, it must be robust to random Gaussian perturbation of all parameters — a much harder condition for degenerate behaviors to satisfy than the single-policy optimization that RL performs.
30.4.3 Cross-Run Stability
ES shows dramatically lower variance across independent runs with identical hyperparameters. The paper reports (on the Countdown task) standard deviations of ±1–3% for ES versus ±5–10% for GRPO and ±3–7% for Dr.GRPO. This stability has direct cost implications: if RL requires 3–5 runs for hyperparameter search plus 2–3 runs for reliability assessment, while ES requires a single run, the total compute for ES may be lower despite its per-iteration expense.
30.4.4 Math Reasoning Benchmarks
On standard math benchmarks (GSM8K, MATH500, Minerva Math, OlympiadBench), the paper reports that ES performs comparably to the best RL methods (GRPO, Dr.GRPO, DAPO), typically ranking in the top 2–3 across benchmarks. The paper emphasizes that the advantages of ES — robustness, stability, no reward hacking — do not come at the cost of reduced performance on well-studied tasks. Specific per-benchmark numbers are reported as competitive rather than as clear wins, distinguishing these results from the Countdown headline.
30.4.5 Six Systematic Advantages
The paper identifies six systematic advantages of ES over RL for LLM fine-tuning, supported by the experimental evidence:
- Long-horizon reward tolerance. ES requires only response-level (outcome) rewards, not token-level credit assignment. For reasoning tasks where only the final answer is graded, ES avoids the credit assignment problem entirely.
- Small populations in high-dimensional spaces. $N = 30$ suffices for spaces with billions of dimensions — overturning the assumption that population size must be proportional to dimensionality.
- Cross-model robustness. A single configuration works across Qwen-2.5 and Llama-3.x families (0.5B–8B). RL methods fail on some models, particularly smaller ones.
- Reward hacking resistance. Distributional optimization is harder to hack than single-solution optimization.
- Cross-run stability. Consistent results across runs; RL is often unstable.
- Memory efficiency. Inference-only operation eliminates gradient, optimizer, and activation storage.
30.5 Implementation Details
30.5.1 Code and Reproducibility
The full source code is available at github.com/VsonicV/es-fine-tuning-paper (CC BY-NC-SA 4.0 license). The repository structure is:
| File | Purpose |
|---|---|
es_fine-tuning_conciseness.py | Conciseness task, correlated noise |
es_fine-tuning_conciseness_iid.py | Conciseness task, i.i.d. noise |
countdown/es_fine-tuning_countdown.py | Countdown task, correlated noise |
countdown/es_fine-tuning_countdown_iid.py | Countdown task, i.i.d. noise |
es_fine-tuning_countdown_accl.py | Accelerated version (vLLM, 10× speed-up) |
requirement.txt | Python dependencies |
All models are publicly available on HuggingFace (Qwen-2.5 and Llama-3.x families). Benchmarks use public datasets (Countdown, GSM8K, MATH500). The fixed hyperparameters ($N = 30$, $\sigma = 0.001$, $\alpha = 5 \times 10^{-4}$) are reported for all ES experiments. Seed control is partial: seeds are used for noise generation but not all sources of randomness are documented.
30.5.2 Execution Commands
The following commands are documented in the repository README:
# From repo: github.com/VsonicV/es-fine-tuning-paper
# Standard version with HuggingFace Accelerate (4 GPUs)
# Command from repository README
# accelerate launch \
# --num_processes 4 \
# --num_machines 1 \
# countdown/es_fine-tuning_countdown.py \
# --data_sample 200 \
# --model_name Qwen/Qwen2.5-3B-Instruct \
# --gpu_threads 1
# Accelerated version with vLLM (4 GPUs, ~10x faster)
# python es_fine-tuning_countdown_accl.py \
# --model_name Qwen/Qwen2.5-3B-Instruct \
# --cuda_devices 0,1,2,3 \
# --num_engines 4 \
# --population_size 30 \
# --num_iterations 1000
30.5.3 Compute Costs
The paper provides sufficient detail to estimate training costs. For a typical Countdown experiment (1000 iterations, Qwen-2.5-3B, 4 × H100 GPUs):
| Version | Time per Iteration | Total Time | GPU-Hours | Estimated Cloud Cost |
|---|---|---|---|---|
| Original (Accelerate) | ~2 min | ~33 hours | ~132 H100-hrs | ~$400–$500 |
| Accelerated (vLLM) | ~12 sec | ~3.3 hours | ~13 H100-hrs | ~$40–$50 |
Provenance: Per-iteration timings are from the paper. Cloud costs are author estimates based on typical H100 rental rates (~$3/GPU-hr) and should be treated as approximate.
The cost advantage over RL is amplified by two factors. First, ES requires no hyperparameter sweeps — one configuration works for all models, while RL typically requires 3–5 sweeps per model. Second, ES's lower memory footprint enables using fewer or smaller GPUs. The paper estimates total RL costs at $500–$2,000 per model (including sweeps) versus $40–$50 for accelerated ES.
30.5.4 Accelerated Architecture
The accelerated version replaces HuggingFace's generate() with vLLM inference engines:
# Pseudocode reflecting: es_fine-tuning_countdown_accl.py
# Uses vLLM for high-throughput inference with continuous batching
from vllm import LLM, SamplingParams
def accelerated_es_fine_tune(model_name, cuda_devices, num_engines,
population_size=30, num_iterations=1000):
"""Accelerated ES using vLLM engines for ~10x inference speed-up."""
# Initialize vLLM engines, one per GPU
engines = []
for device_id in cuda_devices[:num_engines]:
engine = LLM(model=model_name,
tensor_parallel_size=1,
gpu_memory_utilization=0.9)
engines.append(engine)
sampling_params = SamplingParams(
temperature=0.0, # Greedy decoding — deterministic
max_tokens=512
)
for iteration in range(num_iterations):
seeds = generate_seeds(population_size)
rewards = []
# Distribute perturbations across engines
for seed in seeds:
engine = select_engine(engines) # Round-robin or least-loaded
perturb_vllm_weights(engine, seed, sigma=0.001)
outputs = engine.generate(prompts, sampling_params)
reward = compute_reward(outputs)
rewards.append(reward)
restore_vllm_weights(engine, seed, sigma=0.001)
# Normalize and update (same decomposed strategy)
normalized = z_score_normalize(rewards)
decomposed_update(engines[0].model, seeds, normalized,
lr=5e-4, N=population_size)
# Sync updated weights across engines
broadcast_weights(engines)
The key speed-up comes from vLLM's continuous batching and PagedAttention — optimizations for inference throughput that are irrelevant during gradient-based training but transform ES, which is purely inference-based.
30.6 Why Small Populations Work
The most theoretically surprising finding is that $N = 30$ suffices for spaces with billions of dimensions. The paper does not provide a rigorous theoretical explanation, but discusses several hypotheses. We examine each.
30.6.1 Effective Dimensionality
LLM parameter spaces are highly structured. Pre-training creates strong correlations between parameters — most directions in parameter space have minimal impact on model output. The effective dimensionality (the number of directions that meaningfully affect behavior) is likely orders of magnitude smaller than the nominal parameter count. If the effective dimensionality $d_{\text{eff}} \ll d$, then $N = 30$ may provide adequate coverage of the relevant subspace even if it is far too small for uniform coverage of $\mathbb{R}^d$.
30.6.2 Pre-Training as Initialization
ES is not searching from scratch — it fine-tunes from a pre-trained model that already possesses the right structure for language tasks. The pre-trained initialization places $\mu$ in a region of parameter space where small perturbations (radius $\sigma = 0.001$) produce semantically meaningful behavioral changes. This is fundamentally different from the random-initialization setting analyzed by Vemula et al. (2019).
30.6.3 Binary Reward Signal Strength
For the Countdown task, rewards are binary (correct/incorrect). With $N = 30$ perturbations and any non-trivial success rate, the Z-score normalization produces a clear signal: perturbations that led to correct answers receive positive weight, others receive negative weight. The binary partition is informative regardless of the dimensionality of the perturbation.
30.6.4 Greedy Decoding as Amplifier
Without sampling noise, even small parameter changes produce detectable output differences. Stochastic decoding would mask the perturbation signal with sampling variance, potentially requiring larger populations to achieve the same signal-to-noise ratio. Greedy decoding makes each perturbation maximally informative.
These hypotheses remain to be formalized. The gap between the $O(d^2)$ lower bound of Vemula et al. and the empirical success at $N = 30$ represents an open theoretical question for the field.
30.7 Comparative Analysis
30.7.1 ES vs. RL: Systematic Comparison
| Dimension | RL (PPO/GRPO/Dr.GRPO) | ES (This Paper) |
|---|---|---|
| Optimization space | Action space (token sequences) | Parameter space (model weights) |
| Gradient computation | Backpropagation required | Zeroth-order (no backprop) |
| Credit assignment | Token-level or outcome-level | Outcome-level only |
| GPU memory (8B model) | ~136–160 GB | ~18–22 GB |
| Hyperparameter sensitivity | Very high (per-model tuning) | Low (single config for all models) |
| Cross-run stability | Low (±5–10%) | High (±1–3%) |
| Reward hacking | Susceptible | Resistant (distributional optimization) |
| Cross-model robustness | Variable (fails on some small models) | Consistent across all tested models |
| Per-iteration cost | Lower (single forward + backward) | Higher (N=30 forward passes) |
| Total cost (including sweeps) | Higher (multiple sweeps needed) | Lower (single config) |
30.7.2 ES vs. Other Zeroth-Order Methods
MeZO (Malladi et al., 2023) is a prior zeroth-order method for LLM fine-tuning that uses a single random perturbation per step (rather than a population). The paper positions ES favorably against MeZO:
| Method | Memory (8B) | Quality | Mechanism |
|---|---|---|---|
| MeZO | ~20 GB | Below RL baselines | Single perturbation, two-point estimator |
| ES (this paper) | ~18 GB | Exceeds RL baselines | Population of 30, Z-score normalization |
The population mechanism and reward normalization appear to be the critical differentiators that make ES succeed where single-perturbation zeroth-order methods fall short.
30.7.3 Position in the Evolutionary AI Landscape
ES at Scale occupies a unique position among the systems surveyed in this book. It is the only system that directly evolves model parameters at billion scale. The comparison with code-evolution systems is instructive:
ES at Scale is positioned at the lowest abstraction level (continuous parameter vectors) and the highest scale (billions of parameters). This directness is both its strength — no information loss through code, prompt, or merge abstraction — and its constraint: exploration is local in parameter space, bounded by the noise radius $\sigma$.
30.8 Learning Dynamics and Convergence
30.8.1 Training Trajectories
The paper documents distinctive learning dynamics for ES compared to RL. ES exhibits slower but monotonically increasing accuracy curves without the plateaus, oscillations, or reward hacking collapses that characterize RL training. The predictability of ES learning curves enables better estimation of required compute budgets — a practical advantage for research labs managing GPU allocations.
30.8.2 Population Dynamics
Unlike population-based evolutionary systems (AlphaEvolve, OpenEvolve, GEPA) that maintain multiple diverse candidates, ES maintains a single model center $\mu$ with a transient Gaussian cloud around it. Individual perturbations are ephemeral — they exist only for evaluation and are discarded immediately after. Only the center $\mu$ persists across iterations. This is conceptually closer to stochastic gradient descent (where $\mu$ is the parameter vector and $\varepsilon$ provides stochastic exploration) than to a traditional evolutionary algorithm with a persistent population.
This single-center design trades diversity for memory efficiency. Maintaining 30 full copies of an 8B model would require ~480 GB — the transient perturbation approach requires ~18 GB.
30.8.3 Hyperparameter Sensitivity
The paper reports hyperparameter sensitivity analysis across several dimensions:
| Hyperparameter | ES Sensitivity | RL Sensitivity | Implication |
|---|---|---|---|
| Learning rate ($\alpha$) | Low | Very high | RL requires per-model tuning |
| Population / group size ($N$) | Low | Moderate | $N = 30$ works universally for ES |
| Noise scale ($\sigma$) | Moderate | N/A | ES-specific; $\sigma = 0.001$ is robust |
| KL penalty ($\beta$) | N/A | Very high | Wrong $\beta$ causes RL collapse or reward hacking |
The noise scale $\sigma$ is the most ES-specific and most sensitive hyperparameter. The value $\sigma = 0.001$ means perturbations shift each parameter by a standard deviation of 0.1% of its magnitude — small enough to maintain model coherence, large enough to produce detectable behavioral changes under greedy decoding.
30.9 Limitations and Open Questions
The paper acknowledges several limitations that bound the generality of its claims:
Scale ceiling unknown. The largest model tested is 8B parameters. Whether ES remains effective at 70B or 405B is an open question. The effective dimensionality argument (Section 30.6.1) suggests it might, but empirical validation is needed.
Convergence speed. ES is slower per iteration than RL for some tasks. Each iteration requires $N = 30$ full forward passes, whereas RL requires one forward pass plus one backward pass. The paper argues that total cost is competitive when accounting for RL's hyperparameter sweeps, but this depends on the task and model.
Fixed exploration radius. The noise scale $\sigma$ is fixed throughout training. Adaptive schemes like CMA-ES (which adapts the full covariance matrix) could improve performance but are computationally intractable at billion-parameter scale in their standard form. Low-rank covariance approximations might bridge this gap.
No theoretical guarantees. The empirical success at $N = 30$ with $d = 8 \times 10^9$ is not theoretically explained. The analysis of Vemula et al. (2019) predicts poor performance at this scale. The paper falsifies this prediction empirically but does not provide an alternative theoretical framework.
Limited to post-training. All experiments start from pre-trained models. Whether ES could scale to pre-training from scratch — where the initialization argument (Section 30.6.2) does not apply — remains untested.
No cross-task transfer. Each experiment fine-tunes for a single task from the pre-trained model. Sequential fine-tuning, multi-task rewards, and curriculum learning are not explored.
Noise variant impact unclear. The repository provides both correlated and i.i.d. noise variants. The paper states both achieve similar results, but the theoretical implications of correlated noise in billion-dimensional spaces are not analyzed.
30.10 Implications for the Field
30.10.1 Neuroevolution Revived
This paper revives direct parameter-space optimization of neural networks — a research direction considered closed since the early 2010s when gradient-based methods became dominant. By demonstrating that ES works at billion-parameter scale, it reopens investigation into other evolutionary approaches to neural network optimization: CMA-ES variants, novelty-search in parameter space, quality-diversity methods, and hybrid gradient-evolutionary approaches.
30.10.2 Post-Training Paradigm
The paper positions ES as a third post-training paradigm alongside SFT and RL:
| Paradigm | Gradient | Requires | Exploration |
|---|---|---|---|
| SFT (Supervised Fine-Tuning) | Yes (backprop) | Labeled data | None (deterministic) |
| RL (PPO, GRPO, DPO, etc.) | Yes (backprop) | Reward model or rule | Action space (token sampling) |
| ES (this paper) | No (zeroth-order) | Reward function only | Parameter space (Gaussian) |
30.10.3 Practical Applications
The paper suggests several applications where ES may be preferable to RL: (1) RLHF replacement, where reward hacking resistance is particularly valuable for alignment; (2) code generation fine-tuning, where test-case passage provides natural binary outcome rewards; (3) distributed fine-tuning, where ES's embarrassingly parallel nature and scalar-only communication make it ideal for multi-node or multi-datacenter deployment; and (4) safety alignment, where resistance to reward function exploitation is a critical property.
30.10.4 Impact Assessment
| Dimension | Assessment | Notes |
|---|---|---|
| Scientific novelty | Very High | First full-parameter ES at billion scale |
| Practical utility | High | Cheaper, more stable, no reward hacking |
| Reproducibility | High | Full code, public models, fixed hyperparameters |
| Generality | High | Works across model families, sizes, and tasks |
| Theoretical depth | Medium | Empirical strength, limited theoretical explanation |
| Community adoption | Growing | 340+ GitHub stars, ICML 2026 acceptance |
Source: Assessment categories and ratings are author analysis based on the evidence presented in the paper and repository metrics. Star counts are as reported in the source material (early 2026).
Summary
Key takeaway: Evolution Strategies can directly fine-tune billion-parameter LLMs with a population of just 30, outperforming reinforcement learning methods (PPO, GRPO, Dr.GRPO) on every tested model while using 7–8× less GPU memory and requiring no per-model hyperparameter tuning.
Main contribution: The paper overturns the widely held assumption that parameter-space exploration is intractable at modern model scales. Seven engineering innovations — seed-based noise, layer-level in-place perturbation, decomposed updates, Z-score normalization, greedy decoding, parallel evaluation, and learning rate digestion — collectively enable a minimal ES algorithm to operate in billion-dimensional spaces with remarkable efficiency.
What researchers should know: ES occupies a unique position in the evolutionary AI landscape as the only demonstrated method for direct, full-parameter evolutionary optimization of LLMs at scale. Its six systematic advantages over RL — long-horizon tolerance, small populations, cross-model robustness, reward hacking resistance, cross-run stability, and memory efficiency — make it a serious candidate for the post-training toolbox. The theoretical gap between the predicted $O(d^2)$ sample complexity and the observed $N = 30$ success remains the most important open question.