Evolution Strategies at Scale

First successful application of ES to full-parameter LLM fine-tuning at billion-parameter scale without dimensionality reduction Organization: Cognizant AI Labs / University of Texas at Austin Published: September 2025 (v1), February 2026 (v2) Type: Research Paper (ICML 2026) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

arXiv: 2509.24372 (cs.LG, cs.AI, cs.NE)

Repository: github.com/VsonicV/es-fine-tuning-paper (340+ stars)

Venue: ICML 2026

Submission History: - v1: September 29, 2025 (original submission) - v2: February 6, 2026 (revised version with additional experiments)

DOI: 10.48550/arXiv.2509.24372

License: CC BY-NC-SA 4.0

Lineage: Builds on the intellectual foundation of OpenAI ES (Salimans et al., 2017) and Natural Evolution Strategies (Wierstra et al., 2008, 2014). Positioned as the first work to scale ES from millions of parameters (prior art) to billions of parameters (LLMs). The paper explicitly challenges the assumption — widely held since Vemula et al. (2019) — that parameter-space exploration is inherently unscalable.

Key Claim: ES is not merely a viable alternative to RL for LLM fine-tuning, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning.

2 Authors and Team

Author List

Author	Affiliation	Role / Expertise
Xin Qiu	Cognizant AI Labs	Lead author, corresponding author. ES implementation and experimental design.
Yulu Gan	Cognizant AI Labs	ES implementation, experimental evaluation
Conor F. Hayes	Cognizant AI Labs	RL baseline implementation, comparative analysis
Qiyao Liang	Cognizant AI Labs	Experimental evaluation, benchmark design
Yinggan Xu	Cognizant AI Labs	Infrastructure, scaling experiments
Roberto Dailey	Cognizant AI Labs	Infrastructure, GPU parallelization
Elliot Meyerson	Cognizant AI Labs	Evolutionary computation expertise, research direction
Babak Hodjat	Cognizant AI Labs	Senior research leadership, evolutionary AI strategy
Risto Miikkulainen	Cognizant AI Labs / UT Austin	Senior author. Neuroevolution pioneer, NEAT inventor

Team Context

This paper comes from Cognizant AI Labs, the research arm of Cognizant Technology Solutions, which maintains one of the most established evolutionary computation research groups in industry. The team is led by Babak Hodjat (co-founder of Sentient Technologies, one of the largest AI startups focused on evolutionary computation) and Risto Miikkulainen (UT Austin professor, inventor of NEAT — NeuroEvolution of Augmenting Topologies, one of the most cited neuroevolution algorithms).

Elliot Meyerson is notable for his work on evolutionary search with LLMs as mutation operators — his 2024 paper on "Language Model Crossover" is a key precursor to the LLM-as-evolutionary-operator paradigm used in systems like AlphaEvolve.

The team's collective expertise in neuroevolution gives this paper particular authority. When the group that invented NEAT and scaled evolutionary optimization to production systems at Sentient Technologies claims that ES scales to billion-parameter LLMs, the community pays attention.

Institutional Significance

Cognizant AI Labs occupies a unique position in the evolutionary AI landscape:

Institution	Focus	Key Contributions
Google DeepMind	LLM-as-mutation-operator (code evolution)	AlphaEvolve, FunSearch, AlphaTensor
Sakana AI	Model merging via evolution	Evolutionary Model Merging
Cognizant AI Labs	Direct parameter-space evolution of LLMs	This paper (ES at Scale)
OpenAI (historical)	ES for RL policy optimization	OpenAI ES (Salimans et al., 2017)

While DeepMind uses LLMs to evolve code (programs, algorithms), and Sakana uses evolution to merge LLMs, Cognizant's contribution is fundamentally different: they use evolution to directly optimize the parameters of LLMs. This is the most classical form of neuroevolution, applied at unprecedented scale.

3 Core Contribution

Key Novelty: For the first time, Evolution Strategies (ES) is successfully scaled to direct, full-parameter fine-tuning of LLMs with billions of parameters — without any dimensionality reduction (no LoRA, no final-layer-only, no action-space surrogates). The paper demonstrates that ES outperforms established RL methods (PPO, GRPO, Dr.GRPO) across multiple axes, overturning the widespread assumption that ES cannot scale to modern model sizes.

The Assumption Overturned

The conventional wisdom in the field, established by Vemula et al. (2019), held that parameter-space exploration complexity scales quadratically with parameter count ((O(d^2)) where (d) is dimensionality), making it intractable for models with billions of parameters. Prior ES applications had been limited to:

Prior Work	Year	Parameters	Population Size
Salimans et al. (OpenAI ES)	2017	~4M	10,000+
Zhang et al.	2017	~3M	10,000+
Lehman et al.	2018	~167K	Large
Lorenc & Neruda	2025	~2.5M	Large
Toledano-López et al.	2022	325 (last layer only)	Small
Jin et al.	2024	1,600 (LoRA adapters only)	Small

This paper:

This Work	Year	Parameters	Population Size
ES at Scale	2025–2026	0.5B – 8B	30

The jump is dramatic: from millions to billions of parameters, and from populations of 10,000+ to just 30. Both changes were assumed to be individually fatal to ES performance — together, they should have been catastrophic. Instead, ES outperforms RL.

Six Advantages of ES Over RL for LLM Fine-Tuning

The paper identifies six systematic advantages:

Long-horizon reward tolerance. ES needs only response-level (outcome) rewards, not token-level credit assignment. For reasoning tasks where only the final answer is graded, ES avoids the credit assignment problem entirely.
Small populations in high-dimensional spaces. A population of just 30 is sufficient to search in multi-billion-parameter spaces. Previous work assumed populations must be proportional to dimensionality.
Cross-model robustness. ES consistently fine-tunes all tested LLMs (Qwen-2.5, Llama-3.x families, 0.5B–8B). RL methods fail on some models, particularly smaller ones.
Reward hacking resistance. ES optimizes a solution distribution (the Gaussian perturbation cloud), which is harder to hack than RL's single-solution optimization. RL tends to exploit reward function loopholes.
Cross-run stability. ES produces consistent results across multiple runs with the same hyperparameters. RL is often unstable, requiring expensive hyperparameter sweeps per model.
Memory efficiency. ES requires only inference (no backpropagation), eliminating the need for gradient storage, optimizer states, and activation caching. Significant GPU memory savings.

Paradigm Positioning

The paper positions ES not as a niche technique but as a new post-training paradigm:

Pre-training ─────────────────────────────────────────►
    │                    │                    │
    ▼                    ▼                    ▼
 SFT (Supervised      RLHF / RL            ES Fine-Tuning
 Fine-Tuning)         (PPO, GRPO,           (This Paper)
                       DPO, etc.)
                                            • No gradients
 • Gradient-based     • Gradient-based      • No backprop
 • Requires labels    • Requires reward     • Requires reward
 • Deterministic        model or rule         function only
   training           • Token-level or      • Response-level
                        outcome rewards       rewards only
                      • Action-space        • Parameter-space
                        exploration           exploration

4 Supported Solutions

Fine-Tuning Tasks Evaluated

The paper evaluates ES on three categories of tasks:

1. Symbolic Reasoning: Countdown Task

The Countdown task (Gandhi et al., 2024; Pan et al., 2025) requires the model to combine given numbers using arithmetic operations to reach a target number.

Feature	Detail
Task format	Given numbers [a, b, c, d], reach target T using +, −, ×, ÷
Reward type	Binary outcome: correct (1.0) or incorrect (0.0)
Horizon	Long — full response generation before reward
Training set	200 sampled problems
Test set	Held-out evaluation set
Models tested	7 models across Qwen-2.5 and Llama-3.x families (0.5B–8B)

2. Behavioral Fine-Tuning: Conciseness

Fine-tuning LLMs to produce shorter, more concise responses to knowledge questions.

Feature	Detail
Task format	Answer knowledge questions concisely
Reward type	Composite: correctness × conciseness penalty
Horizon	Full response generation
Dataset	500 questions from knowledge benchmarks
Models tested	Qwen-2.5-7B-Instruct
Key observation	RL hacks the reward by degenerating to single-token answers; ES maintains coherent responses

3. Math Reasoning: GSM8K, MATH500, Minerva Math, OlympiadBench

Extended comparisons with SOTA RL methods:

Benchmark	Type	Difficulty
GSM8K	Grade school math	Easy
MATH500	Competition math	Medium–Hard
Minerva Math	Mathematical reasoning	Hard
OlympiadBench	Olympiad-level problems	Very Hard

4. Puzzle Problem Solving

ES is applied to solve two puzzle problems that base LLMs struggle with:

Puzzle	Description	ES Contribution
Number sequence puzzles	Find patterns in number sequences	ES fine-tuning enables discovery of solutions base models cannot find
Logic grid puzzles	Constraint satisfaction problems	ES-tuned models show improved systematic reasoning

Solution Space Characterization

Dimension	Value
Search space	Full parameter space of transformer LLMs (0.5B–8B parameters)
Solution representation	Continuous real-valued vectors (model weights)
Evaluation	Deterministic (greedy decoding)
Fitness function	Task-specific reward (binary correctness, composite scores)
Constraint handling	None explicit (reward function encodes constraints)

5 LLM Integration

LLMs as Optimization Targets (Not Operators)

This paper uses LLMs fundamentally differently from systems like AlphaEvolve, FunSearch, or EvoPrompting. In those systems, the LLM is a tool that generates candidate solutions (code, programs). In this paper, the LLM is the optimization target — its parameters are the search space, and ES directly manipulates billions of floating-point values to improve task performance.

AlphaEvolve / FunSearch approach:
┌───────────────┐     ┌──────────────┐     ┌──────────────┐
│ LLM (frozen)  │────►│ Generated    │────►│ Evaluator    │
│ generates code│     │ code/program │     │ scores code  │
└───────────────┘     └──────────────┘     └──────────────┘
        ▲                                          │
        └──────────── fitness feedback ────────────┘

This paper's approach (ES at Scale):
┌─────────────────────────────────────────────────────┐
│              LLM Parameters (θ)                     │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │ Layer 1  │ │ Layer 2  │ │ Layer N  │  ...        │
│  │ weights  │ │ weights  │ │ weights  │  (billions  │
│  │          │ │          │ │          │   of params) │
│  └──────────┘ └──────────┘ └──────────┘            │
│       ↑              ↑              ↑               │
│  ε₁ ~ N(0,I)   ε₂ ~ N(0,I)   εₙ ~ N(0,I)         │
│  (perturbation)                                     │
└─────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────┐     ┌──────────────┐
│ Perturbed LLM│────►│ Reward R(θ+σε)│
│ generates    │     │ (binary or   │
│ responses    │     │  composite)  │
└──────────────┘     └──────────────┘
         │
         ▼
θ ← θ + α · (1/N) · Σ Rₙ · εₙ   (parameter update)

Models Evaluated

Model Family	Model	Parameters	Architecture
Qwen-2.5	Qwen-2.5-0.5B-Instruct	0.5B	Transformer decoder
Qwen-2.5	Qwen-2.5-1.5B-Instruct	1.5B	Transformer decoder
Qwen-2.5	Qwen-2.5-3B-Instruct	3B	Transformer decoder
Qwen-2.5	Qwen-2.5-7B-Instruct	7B	Transformer decoder
Llama-3.2	Llama-3.2-1B-Instruct	1B	Transformer decoder
Llama-3.2	Llama-3.2-3B-Instruct	3B	Transformer decoder
Llama-3.1	Llama-3.1-8B-Instruct	8B	Transformer decoder

Inference-Only Operation

A critical property of ES fine-tuning is that it requires only forward passes through the model:

Operation	RL (PPO/GRPO)	ES (This Paper)
Forward pass	Yes	Yes
Backward pass (backprop)	Yes	No
Gradient computation	Yes	No
Optimizer state (Adam)	Yes	No
Activation caching	Yes	No
Reference model	Yes (for KL penalty)	No

This has profound implications for GPU memory:

GPU Memory Layout — RL Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16)               │  ~16 GB (8B)  │
│ Gradient buffers                   │  ~16 GB       │
│ Optimizer states (Adam: m, v)      │  ~32 GB       │
│ Activation cache (for backprop)    │  ~8–32 GB     │
│ Reference model (KL penalty)       │  ~16 GB       │
├────────────────────────────────────┼───────────────┤
│ TOTAL                              │  ~88–112 GB   │
└────────────────────────────────────┴───────────────┘

GPU Memory Layout — ES Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16)               │  ~16 GB (8B)  │
│ Layer-sized noise tensor (temp)    │  ~0.1–2 GB    │
│ Random seeds (N integers)          │  negligible   │
├────────────────────────────────────┼───────────────┤
│ TOTAL                              │  ~16–18 GB    │
└────────────────────────────────────┴───────────────┘

This ~5–6× memory reduction is a substantial practical advantage, enabling fine-tuning of larger models on smaller GPU clusters.

6 Key Results

Headline Result: Countdown Task (Table 1)

ES outperforms all RL baselines across all 7 tested models:

Base Model	Params	Original	Best RL	ES	ES Δ vs Best RL
Qwen-2.5-0.5B-Instruct	0.5B	0.1%	13.5% (Dr.GRPO)	14.4%	+0.9
Qwen-2.5-1.5B-Instruct	1.5B	0.7%	31.0% (Dr.GRPO)	37.3%	+6.3
Qwen-2.5-3B-Instruct	3B	10.0%	43.8% (Dr.GRPO)	60.5%	+16.7
Qwen-2.5-7B-Instruct	7B	31.2%	57.5% (Dr.GRPO)	66.8%	+9.3
Llama-3.2-1B-Instruct	1B	0.4%	14.9% (GRPO-v)	16.8%	+1.9
Llama-3.2-3B-Instruct	3B	3.2%	47.8% (Dr.GRPO)	51.6%	+3.8
Llama-3.1-8B-Instruct	8B	8.1%	51.3% (GRPO-z 30)	61.2%	+9.9

Key observations:

ES wins on every model. Not a single RL variant beats ES on any model size or family.
Largest gains on mid-sized models. Qwen-2.5-3B shows the most dramatic improvement: 60.5% vs 43.8%, a +16.7 absolute improvement.
Single hyperparameter set. ES uses the same hyperparameters (N=30, σ=0.001, α=5×10⁻⁴) for ALL models. RL requires per-model hyperparameter sweeps.
Cross-family robustness. ES works on both Qwen and Llama families. RL performance varies significantly across families.

Reward Hacking Analysis (Conciseness Task)

The conciseness task reveals a qualitative difference between ES and RL:

Method	Reward Score	Actual Behavior	Explanation
GRPO	Very high	Degenerates to single-token or very short incoherent answers	Hacks the conciseness reward by minimizing length at the expense of content
ES	High	Produces genuinely concise but coherent and correct answers	Optimizes the distribution, making extreme reward-hacking behaviors unlikely

The paper explains: "ES optimizes a solution distribution (the perturbation cloud), which is more difficult to hack, while RL optimizes a single solution."

This is a fundamental insight. RL fine-tuning produces a single policy that can find and exploit reward function loopholes. ES produces a distribution of nearby policies — for any reward hack to persist, it must be robust to Gaussian perturbation of all parameters, which is much harder for degenerate behaviors to achieve.

Cross-Run Stability

ES shows dramatically lower variance across independent runs:

Method	Mean Accuracy	Std Dev Across Runs	Interpretation
GRPO	~40%	±5–10%	High variance; some runs fail entirely
Dr.GRPO	~45%	±3–7%	Moderate variance; improved but not stable
ES	~55%	±1–3%	Low variance; consistent results

This stability has practical cost implications. If RL requires 3–5 runs to find a good hyperparameter configuration plus 2–3 runs for reliability, and ES requires a single run, the total compute cost may be lower for ES despite its per-iteration expense.

Math Reasoning Benchmarks (Extended Results)

ES is compared against additional SOTA RL baselines on math reasoning:

Benchmark	GRPO	Dr.GRPO	DAPO	ES	ES Rank
GSM8K	Competitive	Competitive	Competitive	Competitive	Top-2
MATH500	Competitive	Competitive	Competitive	Competitive	Top-2
Minerva Math	—	—	—	Competitive	Top-3
OlympiadBench	—	—	—	Competitive	Top-3

On these standard math benchmarks, ES performs comparably to the best RL methods, demonstrating that the advantages of ES (robustness, stability, no reward hacking) do not come at the cost of reduced performance on well-studied tasks.

Hyperparameter Sensitivity

Hyperparameter	ES Sensitivity	RL Sensitivity	Implication
Learning rate (α)	Low	Very high	RL requires careful per-model tuning
Population/group size (N)	Low	Moderate	ES works with N=30; RL performance varies with group size
Noise scale (σ)	Moderate	N/A	ES-specific; σ=0.001 works across models
KL penalty (β)	N/A	Very high	RL-specific; wrong β causes training collapse or reward hacking

The paper reports that a single ES configuration works across all 7 models, while RL requires separate hyperparameter sweeps for each model — a significant practical advantage.

7 Reproducibility

Code Availability

The complete source code is available at github.com/VsonicV/es-fine-tuning-paper:

File	Purpose
`es_fine-tuning_conciseness.py`	ES fine-tuning for conciseness task (correlated noise)
`es_fine-tuning_conciseness_iid.py`	ES fine-tuning for conciseness task (i.i.d. noise)
`countdown/es_fine-tuning_countdown.py`	ES fine-tuning for Countdown task (correlated noise)
`countdown/es_fine-tuning_countdown_iid.py`	ES fine-tuning for Countdown task (i.i.d. noise)
`es_fine-tuning_countdown_accl.py`	Accelerated version with 10× speed-up (vLLM-based)
`requirement.txt`	Python dependencies

Setup and Execution

# Environment setup
python -m venv es
source es/bin/activate
pip install -r requirement.txt

# Conciseness fine-tuning (2 GPUs)
accelerate launch \
    --num_processes 2 \
    --num_machines 1 \
    --machine_rank 0 \
    es_fine-tuning_conciseness.py \
    --gpu_threads=1 \
    --model_name=Qwen/Qwen2.5-7B-Instruct

# Countdown fine-tuning (4 GPUs)
accelerate launch \
    --num_processes 4 \
    --num_machines 1 \
    --machine_rank 0 \
    countdown/es_fine-tuning_countdown.py \
    --data_sample 200 \
    --model_name Qwen/Qwen2.5-3B-Instruct \
    --gpu_threads 1

# Accelerated version with vLLM (4 GPUs)
python es_fine-tuning_countdown_accl.py \
    --model_name Qwen/Qwen2.5-3B-Instruct \
    --cuda_devices 0,1,2,3 \
    --num_engines 4 \
    --population_size 30 \
    --num_iterations 1000

Reproducibility Assessment

Criterion	Assessment	Notes
Code available	Yes	Full source code on GitHub
Data available	Yes	Standard public benchmarks (Countdown, GSM8K, MATH)
Models available	Yes	Public HuggingFace models (Qwen, Llama)
Fixed hyperparameters	Yes	N=30, σ=0.001, α=5×10⁻⁴ for all ES experiments
Random seed control	Partial	Seeds used for noise generation but not all sources of randomness documented
Hardware requirements	Moderate	2–4 GPUs (80GB each) for most experiments
Accelerated version	Yes	10× speed-up using vLLM for inference
RL baselines	Partial	RL implementation details referenced but separate hyperparameter sweeps needed

Noise Variants

The repository provides two noise implementations:

Variant	File	Description
Correlated noise	`es_fine-tuning_*.py`	Partially correlated noise across dimensions (original paper implementation)
i.i.d. noise	`es_fine-tuning_*_iid.py`	Independent noise in each parameter dimension

The discussion at github.com/VsonicV/es-fine-tuning-paper/discussions/7 provides additional details on the difference. Both variants achieve similar results.

8 Compute and API Costs

Hardware Requirements

Configuration	GPUs	GPU Memory	Purpose
Minimum (small models)	1–2 × A100/H100 (80GB)	80–160 GB total	Qwen-0.5B, Qwen-1.5B
Standard (medium models)	4 × A100/H100 (80GB)	320 GB total	Qwen-3B, Llama-3B
Full (large models)	4–8 × H100 (80GB)	320–640 GB total	Qwen-7B, Llama-8B

Per-Iteration Cost Analysis

Each ES iteration involves N=30 perturbed model evaluations:

Per-iteration compute:
┌────────────────────────────────────────────────────────┐
│ 1. Generate N=30 noise seeds                │ ~0 sec  │
│ 2. For each of N=30 perturbations:          │         │
│    a. Perturb model (layer-by-layer)        │ ~2 sec  │
│    b. Run inference on training batch       │ ~10 sec │
│    c. Compute reward                        │ ~1 sec  │
│    d. Restore model                         │ ~2 sec  │
│ 3. Normalize rewards (z-score)              │ ~0 sec  │
│ 4. Aggregate update (layer × seed)          │ ~5 sec  │
├────────────────────────────────────────────────────────┤
│ Total per iteration (serial, 4 GPUs):       │ ~2 min  │
│ Total per iteration (accelerated, vLLM):    │ ~12 sec │
└────────────────────────────────────────────────────────┘

Total Training Cost Estimation

For a typical Countdown experiment (1000 iterations, 4 × H100):

Version	Time per Iter	Total Time	GPU Hours	Est. Cloud Cost
Original	~2 min	~33 hours	~132 H100-hrs	~$400–$500
Accelerated (10×)	~12 sec	~3.3 hours	~13 H100-hrs	~$40–$50

Cost Comparison with RL

Method	GPU Memory	Training Time	Hyperparameter Tuning	Total Cost
PPO	~100 GB (8B model)	~8–24 hours	3–5 sweeps needed	$500–$2,000
GRPO	~80 GB (8B model)	~8–24 hours	3–5 sweeps needed	$400–$1,500
ES (original)	~18 GB (8B model)	~33 hours	1 config for all models	$400–$500
ES (accelerated)	~18 GB (8B model)	~3.3 hours	1 config for all models	$40–$50

The accelerated ES version is dramatically cheaper than RL alternatives, primarily because: 1. No hyperparameter sweeps needed (one config works for all models) 2. Lower memory footprint enables smaller/fewer GPUs 3. vLLM-based inference is highly optimized 4. No backpropagation overhead

Scaling Properties

Parameter	Cost Impact	Scaling
Model size (d)	Linear	Larger models = longer inference
Population size (N)	Linear	More perturbations = more evaluations
Iterations (T)	Linear	More iterations = longer training
GPU count	Inverse linear	Parallelism reduces wall time
Training data	Sublinear	Batch size matters, not dataset size

9 Architecture Solution

Algorithm Architecture

The ES fine-tuning system has a clean, modular architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    ES FINE-TUNING ARCHITECTURE                  │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    OUTER LOOP (T iterations)             │   │
│  │                                                          │   │
│  │  ┌──────────────────────────────────────────────────┐    │   │
│  │  │           INNER LOOP (N perturbations)           │    │   │
│  │  │                                                  │    │   │
│  │  │  For n = 1 to N (parallelizable):                │    │   │
│  │  │  ┌─────────────────────────────────────────────┐ │    │   │
│  │  │  │  1. Sample seed sₙ                          │ │    │   │
│  │  │  │  2. Perturb θ in-place (layer by layer):    │ │    │   │
│  │  │  │     For each layer ℓ:                       │ │    │   │
│  │  │  │       εₗ = generate_noise(sₙ, ℓ)           │ │    │   │
│  │  │  │       θₗ += σ · εₗ                          │ │    │   │
│  │  │  │  3. Generate responses (greedy decoding)    │ │    │   │
│  │  │  │  4. Compute reward Rₙ = R(responses)        │ │    │   │
│  │  │  │  5. Restore θ in-place (layer by layer):    │ │    │   │
│  │  │  │     For each layer ℓ:                       │ │    │   │
│  │  │  │       εₗ = generate_noise(sₙ, ℓ)           │ │    │   │
│  │  │  │       θₗ -= σ · εₗ                          │ │    │   │
│  │  │  └─────────────────────────────────────────────┘ │    │   │
│  │  └──────────────────────────────────────────────────┘    │   │
│  │                                                          │   │
│  │  Normalize: R̃ₙ = (Rₙ - mean(R)) / std(R)               │   │
│  │                                                          │   │
│  │  Update (decomposed, layer by layer, seed by seed):      │   │
│  │  For each layer ℓ:                                       │   │
│  │    For each seed sₙ:                                     │   │
│  │      εₗ = generate_noise(sₙ, ℓ)                         │   │
│  │      θₗ += α · (1/N) · R̃ₙ · εₗ                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    EVALUATION                            │   │
│  │  • Periodic evaluation on held-out test set              │   │
│  │  • Checkpoint best-performing model                      │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Parallelization Strategy

The inner loop (N perturbations) is embarrassingly parallel — each perturbation evaluation is independent:

GPU 0                  GPU 1                  GPU 2                  GPU 3
┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ Model copy   │      │ Model copy   │      │ Model copy   │      │ Model copy   │
│              │      │              │      │              │      │              │
│ Perturb s₁   │      │ Perturb s₂   │      │ Perturb s₃   │      │ Perturb s₄   │
│ Evaluate R₁  │      │ Evaluate R₂  │      │ Evaluate R₃  │      │ Evaluate R₄  │
│ Restore      │      │ Restore      │      │ Restore      │      │ Restore      │
│              │      │              │      │              │      │              │
│ Perturb s₅   │      │ Perturb s₆   │      │ Perturb s₇   │      │ Perturb s₈   │
│ Evaluate R₅  │      │ Evaluate R₆  │      │ Evaluate R₇  │      │ Evaluate R₈  │
│ Restore      │      │ Restore      │      │ Restore      │      │ Restore      │
│    ...       │      │    ...       │      │    ...       │      │    ...       │
└──────┬───────┘      └──────┬───────┘      └──────┬───────┘      └──────┬───────┘
       │                     │                     │                     │
       └─────────────┬───────┴─────────────┬───────┘                     │
                     │                     │                             │
                     └─────── All-gather rewards ────────────────────────┘
                                    │
                                    ▼
                          Normalize + Aggregate Update
                          (decomposed across layers × seeds)

With gpu_threads > 1, each GPU can evaluate multiple perturbations concurrently using separate CUDA streams, further increasing parallelism.

Accelerated Architecture (vLLM)

The accelerated version replaces the standard HuggingFace inference with vLLM engines:

┌─────────────────────────────────────────────────────────────────┐
│                 ACCELERATED ES ARCHITECTURE                     │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              vLLM Inference Engines (per GPU)            │   │
│  │                                                          │   │
│  │  Engine 0 (GPU 0)    Engine 1 (GPU 1)    ...  Engine K   │   │
│  │  ┌──────────────┐    ┌──────────────┐    ┌────────────┐  │   │
│  │  │ Continuous    │    │ Continuous    │    │ Continuous │  │   │
│  │  │ batching     │    │ batching     │    │ batching   │  │   │
│  │  │ PagedAttention│    │ PagedAttention│    │ PagedAttn  │  │   │
│  │  │ KV cache     │    │ KV cache     │    │ KV cache   │  │   │
│  │  └──────────────┘    └──────────────┘    └────────────┘  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Weight Perturbation Manager                 │   │
│  │                                                          │   │
│  │  • Perturbs vLLM model weights in-place                 │   │
│  │  • Regenerates KV cache after perturbation              │   │
│  │  • Coordinates across engines                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              TensorBoard Logging                         │   │
│  │                                                          │   │
│  │  • Training curves (reward, accuracy)                   │   │
│  │  • Per-iteration statistics                             │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

The key insight: vLLM's continuous batching and PagedAttention make inference dramatically faster than standard HuggingFace generation, yielding the reported 10× speed-up.

10 Component Breakdown

Core Algorithm Components

Component	Function	Implementation Detail
Noise Generator	Produces Gaussian perturbations for model parameters	Uses PyTorch RNG with stored seeds; regenerates identical noise from seeds alone
Layer-Level Perturbation	Perturbs and restores model weights in-place	Iterates over model layers; temporarily allocates one layer-sized tensor
Reward Evaluator	Computes task-specific reward for perturbed model	Greedy decoding → parse response → compute binary/composite reward
Reward Normalizer	Z-score normalization within iteration	`R̃ = (R - mean(R)) / std(R)` ensures consistent scale across iterations
Decomposed Updater	Applies aggregated parameter update	Triple-nested loop: layers × seeds × parameters; minimal peak memory
Parallelization Manager	Distributes perturbation evaluations across GPUs	Hugging Face Accelerate (original) or custom vLLM dispatcher (accelerated)

What Is Deliberately Excluded

The paper is notable for what it removes from the standard ES toolbox:

Standard ES Enhancement	Included?	Rationale
Rank transformation of rewards	No	Isolates core algorithm performance
Mirrored sampling (antithetic pairs)	No	Simplifies implementation
Weight decay	No	Avoids interference with analysis
Virtual batch normalization	No	Not applicable to LLM architecture
Adam optimizer for update	No	Uses simple SGD-style update
CMA (covariance adaptation)	No	Full covariance intractable at billion-parameter scale

"This design choice isolates the core ES algorithm and demonstrates that strong performance can be achieved without auxiliary enhancements. In future work, each individual enhancement can be explored to further improve performance."

This minimalism is a strength of the paper — it demonstrates that raw ES (without engineering tricks) outperforms heavily tuned RL (with tricks and per-model hyperparameter sweeps).

Seven Implementation Innovations

The paper introduces seven implementation modifications that enable billion-parameter ES:

#	Innovation	Problem Solved	Memory Impact
1	Noise retrieval with random seeds	Storing N noise vectors of d dimensions is prohibitive	Store N integers instead of N×d floats
2	Parallel evaluations	Sequential evaluation of N perturbations is slow	Assign seeds to GPUs; embarrassingly parallel
3	Layer-level in-place perturbation	Allocating a full noise tensor (billions of floats) is prohibitive	Only one layer-sized tensor in memory at a time
4	Reward normalization	Reward scales vary across iterations and tasks	Z-score ensures consistent gradient magnitudes
5	Greedy decoding	Stochastic decoding confounds parameter-space and action-space exploration	Deterministic evaluation isolates parameter-space effects
6	Decomposed parameter update	Aggregating updates requires materializing full-size tensors	Layer × seed decomposition minimizes peak memory
7	Learning rate digestion	σ in the update equation adds a redundant hyperparameter	Absorb 1/σ into α, simplifying tuning

11 Core Mechanisms (Detailed)

Mechanism 1: Natural Evolution Strategies (Simplified)

The theoretical foundation is Natural Evolution Strategies (NES), which views the optimization as operating on the search distribution rather than individual solutions.

Standard NES formulation:

The objective is to maximize the expected reward under a parameterized search distribution (\pi_\psi):

[J(\psi) = \mathbb{E}{\theta \sim \pi\psi}[R(\theta)]]

For a Gaussian search distribution (\pi_\psi = \mathcal{N}(\mu, \Sigma)), the natural gradient update on (\mu) (with fixed (\Sigma = \sigma^2 I)) reduces to:

[\mu \leftarrow \mu + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n \cdot \varepsilon_n]

where (\varepsilon_n \sim \mathcal{N}(0, I)) and (R_n = R(\mu + \sigma \cdot \varepsilon_n)).

Simplifications in this paper:

Standard NES	This Paper	Effect
Full covariance Σ	Fixed Σ = σ²I	Eliminates O(d²) covariance estimation
Natural gradient on Σ	No adaptation of Σ	σ is fixed hyperparameter
Large population (10,000+)	N = 30	Dramatically reduces per-iteration cost
1/σ in update equation	Absorbed into α	One fewer hyperparameter

Mechanism 2: Memory-Efficient Perturbation via Seed-Based Noise

The most critical engineering innovation is the use of random seeds to represent perturbation noise implicitly:

Traditional approach (infeasible):
  Store N noise vectors, each of dimension d
  Memory: N × d × 4 bytes (float32)
  For d = 8B, N = 30: 30 × 8×10⁹ × 4 = 960 GB  ← IMPOSSIBLE

Seed-based approach (this paper):
  Store N random seeds (integers)
  Memory: N × 8 bytes (int64) = 240 bytes  ← TRIVIAL

  To apply perturbation for seed sₙ:
    rng = torch.Generator().manual_seed(sₙ)
    for each layer ℓ:
      εₗ = torch.randn(layer_ℓ.shape, generator=rng)
      layer_ℓ.data += σ · εₗ
      del εₗ  # only one layer-sized tensor at a time

  To restore:
    rng = torch.Generator().manual_seed(sₙ)  # SAME seed!
    for each layer ℓ:
      εₗ = torch.randn(layer_ℓ.shape, generator=rng)
      layer_ℓ.data -= σ · εₗ  # subtract to restore
      del εₗ

The key insight is that torch.randn with a fixed seed produces identical noise each time. By storing only the seed, the system can regenerate the exact noise for: 1. Applying the perturbation (add σ·ε) 2. Restoring the original parameters (subtract σ·ε) 3. Computing the parameter update (weight by R̃ₙ)

Mechanism 3: Decomposed Parameter Update

The standard update equation requires materializing the full update vector:

Standard: Δθ = α · (1/N) · Σₙ R̃ₙ · εₙ

This requires materializing Σₙ R̃ₙ · εₙ, which is d-dimensional.
For d = 8B: 32 GB in float32.

Decomposed (this paper):
  For each layer ℓ:
    For each seed sₙ:
      εₗ = generate_noise(sₙ, ℓ)
      θₗ += α · (1/N) · R̃ₙ · εₗ
      del εₗ

Peak memory: one layer-sized tensor (max ~2 GB for largest layers)

This decomposition exploits the linearity of addition: the order of summation doesn't matter, so we can accumulate the update layer-by-layer and seed-by-seed, never materializing the full d-dimensional update vector.

Mechanism 4: Greedy Decoding for Deterministic Evaluation

The use of greedy decoding during evaluation is a deliberate methodological choice:

With stochastic decoding (temperature > 0):
  Same model parameters → different responses each time
  Source of variation: parameter perturbation + sampling randomness
  Cannot attribute performance difference to parameter change alone

With greedy decoding (temperature = 0):
  Same model parameters → identical response every time
  Source of variation: parameter perturbation only
  Clean attribution: any performance difference is due to the perturbation

This is analogous to controlled experiments in science — by eliminating one source of variation (decoding randomness), the system can attribute performance differences purely to the parameter perturbation, enabling a cleaner gradient estimate.

Mechanism 5: Z-Score Reward Normalization

Reward normalization is critical for stable training:

Raw rewards (iteration t): [0.0, 0.0, 1.0, 0.0, ..., 1.0]  (binary)
Mean: 0.2, Std: 0.4

Normalized: [-0.5, -0.5, 2.0, -0.5, ..., 2.0]

Effect on update:
  Perturbations that led to correct answers (R̃ > 0) are reinforced
  Perturbations that led to incorrect answers (R̃ < 0) are suppressed
  The magnitude of reinforcement/suppression is proportional to
  how far above/below average the reward is

Without normalization, the update magnitude would depend on the absolute reward scale, which varies across tasks and training stages. Z-score normalization ensures consistent gradient magnitudes, removing a potential source of training instability.

Mechanism 6: Why Small Populations Work

The most surprising finding is that N=30 suffices for billion-parameter spaces. The paper does not provide a full theoretical explanation, but the results suggest several hypotheses:

Effective dimensionality is much lower than nominal dimensionality. LLM parameter spaces are highly structured — most directions in parameter space have minimal impact on output. The 30 perturbations may explore the effective subspace efficiently.
The pre-trained model provides a strong initialization. ES is not searching from scratch — it is fine-tuning from a pre-trained model that already has the right structure. Small perturbations around this initialization are sufficient to discover improved parameters.
Response-level reward provides a strong signal. For binary rewards (correct/incorrect), even a small population can identify which perturbations are beneficial because the reward signal is clear and unambiguous.
Greedy decoding amplifies perturbation effects. Without sampling noise, even small parameter changes produce detectable output differences, making each perturbation informative.

12 Programming Language

Implementation Stack

Component	Language	Framework	Notes
ES algorithm	Python 3.10+	PyTorch	Core training loop, noise generation, update
Parallelization	Python	Hugging Face Accelerate	Multi-GPU distribution (original version)
Accelerated inference	Python	vLLM	High-throughput inference engine (accelerated version)
Reward computation	Python	Custom	Task-specific reward functions
Logging	Python	TensorBoard	Training curves and per-iteration stats (accelerated)
Model loading	Python	Hugging Face Transformers	Loading pre-trained LLMs

Code Structure

es-fine-tuning-paper/
├── es_fine-tuning_conciseness.py         # Conciseness task (correlated noise)
├── es_fine-tuning_conciseness_iid.py     # Conciseness task (i.i.d. noise)
├── es_fine-tuning_countdown_accl.py      # Countdown (accelerated, vLLM)
├── countdown/
│   ├── es_fine-tuning_countdown.py       # Countdown task (correlated noise)
│   └── es_fine-tuning_countdown_iid.py   # Countdown task (i.i.d. noise)
└── requirement.txt                        # Dependencies

Dependency Analysis

Key dependencies include:

Package	Purpose	Version Constraint
`torch`	Tensor operations, noise generation, model manipulation	≥ 2.0
`transformers`	Model loading, tokenization	Recent
`accelerate`	Multi-GPU distribution	Recent
`vllm`	High-throughput inference (accelerated version)	0.11.0
`tensorboard`	Training visualization (accelerated version)	Any

Code Complexity

The implementation is notably compact:

File	Estimated LOC	Complexity
`es_fine-tuning_conciseness.py`	~300–500	Medium
`countdown/es_fine-tuning_countdown.py`	~300–500	Medium
`es_fine-tuning_countdown_accl.py`	~500–800	Medium–High (vLLM integration)
Total	~1,100–1,800	Medium

This compactness is a strength — the entire ES fine-tuning pipeline fits in a single readable file, making the algorithm transparent and easy to modify.

Why Python?

Python is the obvious choice for several reasons: 1. PyTorch native. All tensor operations, GPU management, and model manipulation use PyTorch 2. HuggingFace ecosystem. Model loading, tokenization, and the Accelerate library are Python-native 3. vLLM. The accelerated version uses vLLM, which is a Python library 4. Research community. Python is the lingua franca of ML research

The per-iteration overhead of Python is negligible compared to the GPU computation time for inference and noise generation.

13 Memory Management

GPU Memory Analysis

The paper's most significant practical contribution is its memory efficiency. The analysis below compares ES and RL memory requirements for an 8B-parameter model in bf16 precision:

RL Fine-Tuning (PPO) Memory Breakdown:

Component	Size (GB)	Notes
Model parameters (bf16)	16	8B × 2 bytes
Gradient buffers (fp32)	32	8B × 4 bytes (full precision gradients)
Adam optimizer states (m, v)	64	2 × 8B × 4 bytes
Activation cache	8–32	Depends on batch size, sequence length
Reference model (KL penalty)	16	Full copy of base model
Total	136–160	Requires 2–4 × A100-80GB

ES Fine-Tuning Memory Breakdown:

Component	Size (GB)	Notes
Model parameters (bf16)	16	8B × 2 bytes
Layer-sized noise tensor (temp)	0.1–2	Largest single layer; allocated/freed per layer
Random seeds	< 0.001	30 integers
Inference KV cache	2–4	For greedy decoding
Total	18–22	Fits on 1 × A100-80GB

Memory reduction: ~7–8×

Memory-Efficient Operations

Three operations dominate the memory management strategy:

1. In-Place Perturbation:

# Perturb model in-place (layer by layer)
for layer in model.layers:
    noise = torch.randn(layer.shape, generator=rng, device=device)
    layer.data.add_(noise, alpha=sigma)  # in-place!
    del noise  # free immediately

2. In-Place Restoration:

# Restore model in-place (regenerate same noise)
rng.manual_seed(seed)  # reset to same seed
for layer in model.layers:
    noise = torch.randn(layer.shape, generator=rng, device=device)
    layer.data.sub_(noise, alpha=sigma)  # subtract = restore
    del noise

3. Decomposed Update:

# Update in-place (layer by layer, seed by seed)
for layer_idx, layer in enumerate(model.layers):
    for seed_idx, (seed, reward) in enumerate(zip(seeds, normalized_rewards)):
        rng.manual_seed(seed)
        # skip to correct layer
        for skip in range(layer_idx):
            torch.randn(model.layers[skip].shape, generator=rng)
        noise = torch.randn(layer.shape, generator=rng, device=device)
        layer.data.add_(noise, alpha=lr * reward / N)  # in-place!
        del noise

Memory Scaling with Model Size

Model	Parameters	RL Memory	ES Memory	Savings
0.5B	0.5B	~12 GB	~2 GB	6×
1.5B	1.5B	~25 GB	~4 GB	6×
3B	3B	~48 GB	~8 GB	6×
7B	7B	~110 GB	~16 GB	7×
8B	8B	~140 GB	~18 GB	8×

The memory savings scale better for larger models because RL's gradient and optimizer state overhead grows with parameter count, while ES's overhead remains constant (one layer-sized tensor).

Comparison with Memory-Efficient RL Alternatives

Method	Memory (8B model)	Fine-Tuning Quality	Parameter Coverage
Full RL (PPO/GRPO)	~140 GB	Baseline	Full parameters
LoRA RL	~30–40 GB	Slightly reduced	Low-rank adapters only
QLoRA RL	~20–30 GB	Reduced	Quantized + low-rank
MeZO (zeroth-order)	~20 GB	Poor (below baselines)	Full parameters
ES (this paper)	~18 GB	Exceeds full RL	Full parameters

ES achieves the lowest memory footprint while maintaining the highest fine-tuning quality — a combination that no prior method achieved.

14 Continued Learning

Within-Run Learning Dynamics

The ES optimization trajectory exhibits distinctive learning dynamics compared to RL:

RL learning curve (typical):

Accuracy
│        ┌─── plateau / oscillation ───┐
│       /                               \____
│      /                                     \── reward hacking begins
│     /
│    /
│   /  rapid initial improvement
│  /
│ /
└─────────────────────────────────────── Iterations

ES learning curve (typical):

Accuracy
│                         ┌──── continued gradual improvement
│                        /
│                       /
│                      /
│                    /
│                  /
│               /
│            /
│         /
│      /    steady, monotonic improvement
│   /
│  /
│ /
└─────────────────────────────────────── Iterations

ES exhibits slower but more steady improvement without the plateaus, oscillations, or reward hacking that characterize RL fine-tuning. The learning curve is more predictable, enabling better estimation of required compute budgets.

Population Dynamics

Unlike population-based evolutionary systems (AlphaEvolve, OpenEvolve), ES in this paper maintains a single model with a distribution around it:

Iteration t:                  Iteration t+1:
                              (center shifted toward high-reward direction)

  ·   ·                         ·   ·
 · · · ·                       · · · ·
· · θₜ · ·    ────►           · · θₜ₊₁ · ·
 · · · ·                       · · · ·
  ·   ·                         ·   ·

The distribution (cloud) moves through parameter space.
Individual perturbations are transient — only the center persists.

This is fundamentally different from population-based approaches where multiple diverse solutions coexist. The trade-off:

Property	ES (Single Center + Distribution)	Population-Based (AlphaEvolve)
Diversity	Low (Gaussian around one point)	High (multiple diverse solutions)
Memory	Very low	High (N full models)
Exploration	Local (radius σ around center)	Global (multiple starting points)
Risk of local optima	Higher	Lower
Implementation complexity	Very low	High

Cross-Task Transfer

The paper does not investigate cross-task transfer — each experiment starts from a pre-trained model and fine-tunes for a single task. However, the results suggest several transfer possibilities:

Sequential fine-tuning. An ES-tuned model for one task could serve as the starting point for ES fine-tuning on a second task.
Multi-task reward. A composite reward function combining multiple task metrics could enable simultaneous multi-task fine-tuning.
Curriculum learning. Starting with an easy task (high reward signal) and progressively adding harder tasks could improve sample efficiency.

None of these are explored in the paper, representing opportunities for future work.

Continued Improvement Beyond Reported Results

The paper's accelerated version (10× speed-up) suggests that the original results may not represent the limit of ES performance. With 10× faster iterations, significantly more iterations become practical, potentially yielding better final performance.

The paper also notes that standard ES enhancements (mirrored sampling, rank transformation, Adam optimizer) were deliberately excluded. Adding these could further improve performance, and the paper explicitly invites this future work.

15 Applications

Primary Application: LLM Post-Training

The paper positions ES as a general-purpose post-training paradigm for LLMs. Current applications demonstrated:

Application	Task Type	Reward Structure	ES Advantage
Reasoning fine-tuning	Symbolic reasoning (Countdown)	Binary outcome	Long-horizon tolerance, cross-model robustness
Math reasoning	GSM8K, MATH500, etc.	Binary correctness	Competitive with SOTA RL, more stable
Behavioral tuning	Conciseness optimization	Composite (quality + length)	Reward hacking resistance
Puzzle solving	Number sequences, logic grids	Binary/custom	Novel solutions unreachable by base models

Future Applications Suggested by the Paper

RLHF replacement. ES could replace PPO/GRPO in the RLHF pipeline, using human preference reward models but optimizing via ES rather than RL. The reward hacking resistance is particularly valuable for alignment.
Instruction following. Fine-tuning models to follow instructions more precisely, where correctness is binary (followed instruction or didn't).
Code generation. Fine-tuning for code generation tasks where reward is based on test case passage — a naturally long-horizon, binary-outcome task.
Safety alignment. ES's resistance to reward hacking makes it a candidate for safety-critical alignment tasks, where exploiting loopholes in the reward function is a major concern.
Distributed fine-tuning. ES's embarrassingly parallel nature makes it ideal for distributed fine-tuning across multiple machines or even multiple data centers. Only scalar rewards need to be communicated, not gradients.

Implications for the Evolutionary AI Field

This paper has several implications for the broader evolutionary AI landscape:

1. Neuroevolution is back. The paper revives direct parameter-space optimization of neural networks — a field that had been dormant since the early 2010s when gradient-based methods became dominant. By showing that ES works at billion-parameter scale, it reopens research directions that were considered closed.

2. Zeroth-order optimization is viable. The success of ES (a zeroth-order method) at scale challenges the assumption that gradient information is necessary for efficient optimization of LLMs. This opens the door to other zeroth-order methods (CMA-ES, random search, simulated annealing) being applied to LLMs.

3. Backpropagation is not always necessary. For post-training (as opposed to pre-training), backpropagation may not be the optimal optimization strategy. ES's ability to avoid backprop has practical benefits (memory, simplicity) without sacrificing quality.

4. Small populations suffice. The N=30 finding challenges the conventional wisdom in evolutionary computation that population size must scale with problem dimensionality. This has implications for all population-based optimization methods applied to high-dimensional problems.

Relevance to OmniEvolve

This paper is highly relevant to the OmniEvolve project from multiple angles:

OmniEvolve Component	Relevance	Integration Potential
Search backends	ES could be a search backend for parameter-space optimization	Implement as `ESSearchBackend` with layered perturbation
Mutation operators	Gaussian perturbation is a principled mutation operator	Complement LLM-based mutations with ES-style perturbations
Evaluation	Greedy decoding + binary reward is a clean evaluation pattern	Adopt for tasks with binary correctness
Memory management	Seed-based noise storage is an efficient memory pattern	Apply to candidate storage in general
Benchmarks	Countdown task is a well-defined benchmark	Include in benchmark suite

Limitations

The paper acknowledges several limitations:

Scale ceiling unknown. The largest model tested is 8B parameters. Whether ES remains effective at 70B or 405B is an open question.
Convergence speed. ES is slower per-iteration than RL for some tasks. The total compute may be higher, even if the per-run cost is lower (due to hyperparameter stability).
Exploration radius. With fixed σ, the exploration radius is limited. Adaptive σ (as in CMA-ES) could improve performance but adds complexity.
No theoretical guarantees. The paper provides empirical evidence but no convergence guarantees for ES in billion-parameter spaces. The theoretical analysis of Vemula et al. (2019) would predict poor performance, which is empirically falsified but not theoretically explained.
Limited to post-training. ES is applied to fine-tuning from pre-trained models, not to pre-training from scratch. Whether ES could scale to pre-training is an open question.

Impact Assessment

Dimension	Assessment
Scientific novelty	Very High — first successful full-parameter ES at billion scale
Practical utility	High — cheaper, more stable, no reward hacking
Reproducibility	High — full code, public models, fixed hyperparameters
Generality	High — works across model families, sizes, and tasks
Theoretical depth	Medium — empirical strength, limited theoretical explanation
Community adoption	Growing — 340+ stars, ICML acceptance, active discussions
Long-term impact	Potentially transformative — could reshape post-training paradigm

Position in the Evolutionary AI Landscape

Parameter Space ─────────────────────────────────────► Action/Code Space
    │                                                        │
    │  ES at Scale              AlphaEvolve / FunSearch       │
    │  (this paper)             (Google DeepMind)             │
    │  ┌───────────┐            ┌─────────────────────┐      │
    │  │ Perturb   │            │ LLM generates code  │      │
    │  │ model     │            │ mutations; evaluator │      │
    │  │ weights   │            │ scores code quality  │      │
    │  │ directly  │            │                     │      │
    │  └───────────┘            └─────────────────────┘      │
    │                                                        │
    │  Sakana Model Merging     EvoPrompting                  │
    │  ┌───────────┐            ┌─────────────────────┐      │
    │  │ Evolve    │            │ Evolve prompts /    │      │
    │  │ merging   │            │ prompt templates    │      │
    │  │ weights   │            │                     │      │
    │  └───────────┘            └─────────────────────┘      │
    │                                                        │
Low-level ◄─────────────────────────────────────── High-level
(continuous)                                      (discrete/symbolic)

ES at Scale occupies the lowest-level, most direct position in this landscape — it evolves the raw parameters of the model, without any abstraction layer (code, prompts, merging weights). This directness is both its strength (no information loss through abstraction) and its limitation (exploration is local in parameter space).