Evolution Strategies at Scale
First successful application of ES to full-parameter LLM fine-tuning at billion-parameter scale without dimensionality reduction Organization: Cognizant AI Labs / University of Texas at Austin Published: September 2025 (v1), February 2026 (v2) Type: Research Paper (ICML 2026) Report Type: PhD-Level Technical Analysis Report Date: April 2026
Table of Contents
- Full Title and Attribution
- Authors and Team
- Core Contribution
- Supported Solutions
- LLM Integration
- Key Results
- Reproducibility
- Compute and API Costs
- Architecture Solution
- Component Breakdown
- Core Mechanisms (Detailed)
- Programming Language
- Memory Management
- Continued Learning
- Applications
1 Full Title and Attribution
Full Title: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning
arXiv: 2509.24372 (cs.LG, cs.AI, cs.NE)
Repository: github.com/VsonicV/es-fine-tuning-paper (340+ stars)
Venue: ICML 2026
Submission History: - v1: September 29, 2025 (original submission) - v2: February 6, 2026 (revised version with additional experiments)
DOI: 10.48550/arXiv.2509.24372
License: CC BY-NC-SA 4.0
Lineage: Builds on the intellectual foundation of OpenAI ES (Salimans et al., 2017) and Natural Evolution Strategies (Wierstra et al., 2008, 2014). Positioned as the first work to scale ES from millions of parameters (prior art) to billions of parameters (LLMs). The paper explicitly challenges the assumption — widely held since Vemula et al. (2019) — that parameter-space exploration is inherently unscalable.
Key Claim: ES is not merely a viable alternative to RL for LLM fine-tuning, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning.
2 Authors and Team
Author List
| Author | Affiliation | Role / Expertise |
|---|---|---|
| Xin Qiu | Cognizant AI Labs | Lead author, corresponding author. ES implementation and experimental design. |
| Yulu Gan | Cognizant AI Labs | ES implementation, experimental evaluation |
| Conor F. Hayes | Cognizant AI Labs | RL baseline implementation, comparative analysis |
| Qiyao Liang | Cognizant AI Labs | Experimental evaluation, benchmark design |
| Yinggan Xu | Cognizant AI Labs | Infrastructure, scaling experiments |
| Roberto Dailey | Cognizant AI Labs | Infrastructure, GPU parallelization |
| Elliot Meyerson | Cognizant AI Labs | Evolutionary computation expertise, research direction |
| Babak Hodjat | Cognizant AI Labs | Senior research leadership, evolutionary AI strategy |
| Risto Miikkulainen | Cognizant AI Labs / UT Austin | Senior author. Neuroevolution pioneer, NEAT inventor |
Team Context
This paper comes from Cognizant AI Labs, the research arm of Cognizant Technology Solutions, which maintains one of the most established evolutionary computation research groups in industry. The team is led by Babak Hodjat (co-founder of Sentient Technologies, one of the largest AI startups focused on evolutionary computation) and Risto Miikkulainen (UT Austin professor, inventor of NEAT — NeuroEvolution of Augmenting Topologies, one of the most cited neuroevolution algorithms).
Elliot Meyerson is notable for his work on evolutionary search with LLMs as mutation operators — his 2024 paper on "Language Model Crossover" is a key precursor to the LLM-as-evolutionary-operator paradigm used in systems like AlphaEvolve.
The team's collective expertise in neuroevolution gives this paper particular authority. When the group that invented NEAT and scaled evolutionary optimization to production systems at Sentient Technologies claims that ES scales to billion-parameter LLMs, the community pays attention.
Institutional Significance
Cognizant AI Labs occupies a unique position in the evolutionary AI landscape:
| Institution | Focus | Key Contributions |
|---|---|---|
| Google DeepMind | LLM-as-mutation-operator (code evolution) | AlphaEvolve, FunSearch, AlphaTensor |
| Sakana AI | Model merging via evolution | Evolutionary Model Merging |
| Cognizant AI Labs | Direct parameter-space evolution of LLMs | This paper (ES at Scale) |
| OpenAI (historical) | ES for RL policy optimization | OpenAI ES (Salimans et al., 2017) |
While DeepMind uses LLMs to evolve code (programs, algorithms), and Sakana uses evolution to merge LLMs, Cognizant's contribution is fundamentally different: they use evolution to directly optimize the parameters of LLMs. This is the most classical form of neuroevolution, applied at unprecedented scale.
3 Core Contribution
Key Novelty: For the first time, Evolution Strategies (ES) is successfully scaled to direct, full-parameter fine-tuning of LLMs with billions of parameters — without any dimensionality reduction (no LoRA, no final-layer-only, no action-space surrogates). The paper demonstrates that ES outperforms established RL methods (PPO, GRPO, Dr.GRPO) across multiple axes, overturning the widespread assumption that ES cannot scale to modern model sizes.
The Assumption Overturned
The conventional wisdom in the field, established by Vemula et al. (2019), held that parameter-space exploration complexity scales quadratically with parameter count ((O(d^2)) where (d) is dimensionality), making it intractable for models with billions of parameters. Prior ES applications had been limited to:
| Prior Work | Year | Parameters | Population Size |
|---|---|---|---|
| Salimans et al. (OpenAI ES) | 2017 | ~4M | 10,000+ |
| Zhang et al. | 2017 | ~3M | 10,000+ |
| Lehman et al. | 2018 | ~167K | Large |
| Lorenc & Neruda | 2025 | ~2.5M | Large |
| Toledano-López et al. | 2022 | 325 (last layer only) | Small |
| Jin et al. | 2024 | 1,600 (LoRA adapters only) | Small |
This paper:
| This Work | Year | Parameters | Population Size |
|---|---|---|---|
| ES at Scale | 2025–2026 | 0.5B – 8B | 30 |
The jump is dramatic: from millions to billions of parameters, and from populations of 10,000+ to just 30. Both changes were assumed to be individually fatal to ES performance — together, they should have been catastrophic. Instead, ES outperforms RL.
Six Advantages of ES Over RL for LLM Fine-Tuning
The paper identifies six systematic advantages:
-
Long-horizon reward tolerance. ES needs only response-level (outcome) rewards, not token-level credit assignment. For reasoning tasks where only the final answer is graded, ES avoids the credit assignment problem entirely.
-
Small populations in high-dimensional spaces. A population of just 30 is sufficient to search in multi-billion-parameter spaces. Previous work assumed populations must be proportional to dimensionality.
-
Cross-model robustness. ES consistently fine-tunes all tested LLMs (Qwen-2.5, Llama-3.x families, 0.5B–8B). RL methods fail on some models, particularly smaller ones.
-
Reward hacking resistance. ES optimizes a solution distribution (the Gaussian perturbation cloud), which is harder to hack than RL's single-solution optimization. RL tends to exploit reward function loopholes.
-
Cross-run stability. ES produces consistent results across multiple runs with the same hyperparameters. RL is often unstable, requiring expensive hyperparameter sweeps per model.
-
Memory efficiency. ES requires only inference (no backpropagation), eliminating the need for gradient storage, optimizer states, and activation caching. Significant GPU memory savings.
Paradigm Positioning
The paper positions ES not as a niche technique but as a new post-training paradigm:
Pre-training ─────────────────────────────────────────►
│ │ │
▼ ▼ ▼
SFT (Supervised RLHF / RL ES Fine-Tuning
Fine-Tuning) (PPO, GRPO, (This Paper)
DPO, etc.)
• No gradients
• Gradient-based • Gradient-based • No backprop
• Requires labels • Requires reward • Requires reward
• Deterministic model or rule function only
training • Token-level or • Response-level
outcome rewards rewards only
• Action-space • Parameter-space
exploration exploration
4 Supported Solutions
Fine-Tuning Tasks Evaluated
The paper evaluates ES on three categories of tasks:
1. Symbolic Reasoning: Countdown Task
The Countdown task (Gandhi et al., 2024; Pan et al., 2025) requires the model to combine given numbers using arithmetic operations to reach a target number.
| Feature | Detail |
|---|---|
| Task format | Given numbers [a, b, c, d], reach target T using +, −, ×, ÷ |
| Reward type | Binary outcome: correct (1.0) or incorrect (0.0) |
| Horizon | Long — full response generation before reward |
| Training set | 200 sampled problems |
| Test set | Held-out evaluation set |
| Models tested | 7 models across Qwen-2.5 and Llama-3.x families (0.5B–8B) |
2. Behavioral Fine-Tuning: Conciseness
Fine-tuning LLMs to produce shorter, more concise responses to knowledge questions.
| Feature | Detail |
|---|---|
| Task format | Answer knowledge questions concisely |
| Reward type | Composite: correctness × conciseness penalty |
| Horizon | Full response generation |
| Dataset | 500 questions from knowledge benchmarks |
| Models tested | Qwen-2.5-7B-Instruct |
| Key observation | RL hacks the reward by degenerating to single-token answers; ES maintains coherent responses |
3. Math Reasoning: GSM8K, MATH500, Minerva Math, OlympiadBench
Extended comparisons with SOTA RL methods:
| Benchmark | Type | Difficulty |
|---|---|---|
| GSM8K | Grade school math | Easy |
| MATH500 | Competition math | Medium–Hard |
| Minerva Math | Mathematical reasoning | Hard |
| OlympiadBench | Olympiad-level problems | Very Hard |
4. Puzzle Problem Solving
ES is applied to solve two puzzle problems that base LLMs struggle with:
| Puzzle | Description | ES Contribution |
|---|---|---|
| Number sequence puzzles | Find patterns in number sequences | ES fine-tuning enables discovery of solutions base models cannot find |
| Logic grid puzzles | Constraint satisfaction problems | ES-tuned models show improved systematic reasoning |
Solution Space Characterization
| Dimension | Value |
|---|---|
| Search space | Full parameter space of transformer LLMs (0.5B–8B parameters) |
| Solution representation | Continuous real-valued vectors (model weights) |
| Evaluation | Deterministic (greedy decoding) |
| Fitness function | Task-specific reward (binary correctness, composite scores) |
| Constraint handling | None explicit (reward function encodes constraints) |
5 LLM Integration
LLMs as Optimization Targets (Not Operators)
This paper uses LLMs fundamentally differently from systems like AlphaEvolve, FunSearch, or EvoPrompting. In those systems, the LLM is a tool that generates candidate solutions (code, programs). In this paper, the LLM is the optimization target — its parameters are the search space, and ES directly manipulates billions of floating-point values to improve task performance.
AlphaEvolve / FunSearch approach:
┌───────────────┐ ┌──────────────┐ ┌──────────────┐
│ LLM (frozen) │────►│ Generated │────►│ Evaluator │
│ generates code│ │ code/program │ │ scores code │
└───────────────┘ └──────────────┘ └──────────────┘
▲ │
└──────────── fitness feedback ────────────┘
This paper's approach (ES at Scale):
┌─────────────────────────────────────────────────────┐
│ LLM Parameters (θ) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Layer 1 │ │ Layer 2 │ │ Layer N │ ... │
│ │ weights │ │ weights │ │ weights │ (billions │
│ │ │ │ │ │ │ of params) │
│ └──────────┘ └──────────┘ └──────────┘ │
│ ↑ ↑ ↑ │
│ ε₁ ~ N(0,I) ε₂ ~ N(0,I) εₙ ~ N(0,I) │
│ (perturbation) │
└─────────────────────────────────────────────────────┘
│
▼
┌──────────────┐ ┌──────────────┐
│ Perturbed LLM│────►│ Reward R(θ+σε)│
│ generates │ │ (binary or │
│ responses │ │ composite) │
└──────────────┘ └──────────────┘
│
▼
θ ← θ + α · (1/N) · Σ Rₙ · εₙ (parameter update)
Models Evaluated
| Model Family | Model | Parameters | Architecture |
|---|---|---|---|
| Qwen-2.5 | Qwen-2.5-0.5B-Instruct | 0.5B | Transformer decoder |
| Qwen-2.5 | Qwen-2.5-1.5B-Instruct | 1.5B | Transformer decoder |
| Qwen-2.5 | Qwen-2.5-3B-Instruct | 3B | Transformer decoder |
| Qwen-2.5 | Qwen-2.5-7B-Instruct | 7B | Transformer decoder |
| Llama-3.2 | Llama-3.2-1B-Instruct | 1B | Transformer decoder |
| Llama-3.2 | Llama-3.2-3B-Instruct | 3B | Transformer decoder |
| Llama-3.1 | Llama-3.1-8B-Instruct | 8B | Transformer decoder |
Inference-Only Operation
A critical property of ES fine-tuning is that it requires only forward passes through the model:
| Operation | RL (PPO/GRPO) | ES (This Paper) |
|---|---|---|
| Forward pass | Yes | Yes |
| Backward pass (backprop) | Yes | No |
| Gradient computation | Yes | No |
| Optimizer state (Adam) | Yes | No |
| Activation caching | Yes | No |
| Reference model | Yes (for KL penalty) | No |
This has profound implications for GPU memory:
GPU Memory Layout — RL Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16) │ ~16 GB (8B) │
│ Gradient buffers │ ~16 GB │
│ Optimizer states (Adam: m, v) │ ~32 GB │
│ Activation cache (for backprop) │ ~8–32 GB │
│ Reference model (KL penalty) │ ~16 GB │
├────────────────────────────────────┼───────────────┤
│ TOTAL │ ~88–112 GB │
└────────────────────────────────────┴───────────────┘
GPU Memory Layout — ES Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16) │ ~16 GB (8B) │
│ Layer-sized noise tensor (temp) │ ~0.1–2 GB │
│ Random seeds (N integers) │ negligible │
├────────────────────────────────────┼───────────────┤
│ TOTAL │ ~16–18 GB │
└────────────────────────────────────┴───────────────┘
This ~5–6× memory reduction is a substantial practical advantage, enabling fine-tuning of larger models on smaller GPU clusters.
6 Key Results
Headline Result: Countdown Task (Table 1)
ES outperforms all RL baselines across all 7 tested models:
| Base Model | Params | Original | Best RL | ES | ES Δ vs Best RL |
|---|---|---|---|---|---|
| Qwen-2.5-0.5B-Instruct | 0.5B | 0.1% | 13.5% (Dr.GRPO) | 14.4% | +0.9 |
| Qwen-2.5-1.5B-Instruct | 1.5B | 0.7% | 31.0% (Dr.GRPO) | 37.3% | +6.3 |
| Qwen-2.5-3B-Instruct | 3B | 10.0% | 43.8% (Dr.GRPO) | 60.5% | +16.7 |
| Qwen-2.5-7B-Instruct | 7B | 31.2% | 57.5% (Dr.GRPO) | 66.8% | +9.3 |
| Llama-3.2-1B-Instruct | 1B | 0.4% | 14.9% (GRPO-v) | 16.8% | +1.9 |
| Llama-3.2-3B-Instruct | 3B | 3.2% | 47.8% (Dr.GRPO) | 51.6% | +3.8 |
| Llama-3.1-8B-Instruct | 8B | 8.1% | 51.3% (GRPO-z 30) | 61.2% | +9.9 |
Key observations:
- ES wins on every model. Not a single RL variant beats ES on any model size or family.
- Largest gains on mid-sized models. Qwen-2.5-3B shows the most dramatic improvement: 60.5% vs 43.8%, a +16.7 absolute improvement.
- Single hyperparameter set. ES uses the same hyperparameters (N=30, σ=0.001, α=5×10⁻⁴) for ALL models. RL requires per-model hyperparameter sweeps.
- Cross-family robustness. ES works on both Qwen and Llama families. RL performance varies significantly across families.
Reward Hacking Analysis (Conciseness Task)
The conciseness task reveals a qualitative difference between ES and RL:
| Method | Reward Score | Actual Behavior | Explanation |
|---|---|---|---|
| GRPO | Very high | Degenerates to single-token or very short incoherent answers | Hacks the conciseness reward by minimizing length at the expense of content |
| ES | High | Produces genuinely concise but coherent and correct answers | Optimizes the distribution, making extreme reward-hacking behaviors unlikely |
The paper explains: "ES optimizes a solution distribution (the perturbation cloud), which is more difficult to hack, while RL optimizes a single solution."
This is a fundamental insight. RL fine-tuning produces a single policy that can find and exploit reward function loopholes. ES produces a distribution of nearby policies — for any reward hack to persist, it must be robust to Gaussian perturbation of all parameters, which is much harder for degenerate behaviors to achieve.
Cross-Run Stability
ES shows dramatically lower variance across independent runs:
| Method | Mean Accuracy | Std Dev Across Runs | Interpretation |
|---|---|---|---|
| GRPO | ~40% | ±5–10% | High variance; some runs fail entirely |
| Dr.GRPO | ~45% | ±3–7% | Moderate variance; improved but not stable |
| ES | ~55% | ±1–3% | Low variance; consistent results |
This stability has practical cost implications. If RL requires 3–5 runs to find a good hyperparameter configuration plus 2–3 runs for reliability, and ES requires a single run, the total compute cost may be lower for ES despite its per-iteration expense.
Math Reasoning Benchmarks (Extended Results)
ES is compared against additional SOTA RL baselines on math reasoning:
| Benchmark | GRPO | Dr.GRPO | DAPO | ES | ES Rank |
|---|---|---|---|---|---|
| GSM8K | Competitive | Competitive | Competitive | Competitive | Top-2 |
| MATH500 | Competitive | Competitive | Competitive | Competitive | Top-2 |
| Minerva Math | — | — | — | Competitive | Top-3 |
| OlympiadBench | — | — | — | Competitive | Top-3 |
On these standard math benchmarks, ES performs comparably to the best RL methods, demonstrating that the advantages of ES (robustness, stability, no reward hacking) do not come at the cost of reduced performance on well-studied tasks.
Hyperparameter Sensitivity
| Hyperparameter | ES Sensitivity | RL Sensitivity | Implication |
|---|---|---|---|
| Learning rate (α) | Low | Very high | RL requires careful per-model tuning |
| Population/group size (N) | Low | Moderate | ES works with N=30; RL performance varies with group size |
| Noise scale (σ) | Moderate | N/A | ES-specific; σ=0.001 works across models |
| KL penalty (β) | N/A | Very high | RL-specific; wrong β causes training collapse or reward hacking |
The paper reports that a single ES configuration works across all 7 models, while RL requires separate hyperparameter sweeps for each model — a significant practical advantage.
7 Reproducibility
Code Availability
The complete source code is available at github.com/VsonicV/es-fine-tuning-paper:
| File | Purpose |
|---|---|
es_fine-tuning_conciseness.py |
ES fine-tuning for conciseness task (correlated noise) |
es_fine-tuning_conciseness_iid.py |
ES fine-tuning for conciseness task (i.i.d. noise) |
countdown/es_fine-tuning_countdown.py |
ES fine-tuning for Countdown task (correlated noise) |
countdown/es_fine-tuning_countdown_iid.py |
ES fine-tuning for Countdown task (i.i.d. noise) |
es_fine-tuning_countdown_accl.py |
Accelerated version with 10× speed-up (vLLM-based) |
requirement.txt |
Python dependencies |
Setup and Execution
# Environment setup
python -m venv es
source es/bin/activate
pip install -r requirement.txt
# Conciseness fine-tuning (2 GPUs)
accelerate launch \
--num_processes 2 \
--num_machines 1 \
--machine_rank 0 \
es_fine-tuning_conciseness.py \
--gpu_threads=1 \
--model_name=Qwen/Qwen2.5-7B-Instruct
# Countdown fine-tuning (4 GPUs)
accelerate launch \
--num_processes 4 \
--num_machines 1 \
--machine_rank 0 \
countdown/es_fine-tuning_countdown.py \
--data_sample 200 \
--model_name Qwen/Qwen2.5-3B-Instruct \
--gpu_threads 1
# Accelerated version with vLLM (4 GPUs)
python es_fine-tuning_countdown_accl.py \
--model_name Qwen/Qwen2.5-3B-Instruct \
--cuda_devices 0,1,2,3 \
--num_engines 4 \
--population_size 30 \
--num_iterations 1000
Reproducibility Assessment
| Criterion | Assessment | Notes |
|---|---|---|
| Code available | Yes | Full source code on GitHub |
| Data available | Yes | Standard public benchmarks (Countdown, GSM8K, MATH) |
| Models available | Yes | Public HuggingFace models (Qwen, Llama) |
| Fixed hyperparameters | Yes | N=30, σ=0.001, α=5×10⁻⁴ for all ES experiments |
| Random seed control | Partial | Seeds used for noise generation but not all sources of randomness documented |
| Hardware requirements | Moderate | 2–4 GPUs (80GB each) for most experiments |
| Accelerated version | Yes | 10× speed-up using vLLM for inference |
| RL baselines | Partial | RL implementation details referenced but separate hyperparameter sweeps needed |
Noise Variants
The repository provides two noise implementations:
| Variant | File | Description |
|---|---|---|
| Correlated noise | es_fine-tuning_*.py |
Partially correlated noise across dimensions (original paper implementation) |
| i.i.d. noise | es_fine-tuning_*_iid.py |
Independent noise in each parameter dimension |
The discussion at github.com/VsonicV/es-fine-tuning-paper/discussions/7 provides additional details on the difference. Both variants achieve similar results.
8 Compute and API Costs
Hardware Requirements
| Configuration | GPUs | GPU Memory | Purpose |
|---|---|---|---|
| Minimum (small models) | 1–2 × A100/H100 (80GB) | 80–160 GB total | Qwen-0.5B, Qwen-1.5B |
| Standard (medium models) | 4 × A100/H100 (80GB) | 320 GB total | Qwen-3B, Llama-3B |
| Full (large models) | 4–8 × H100 (80GB) | 320–640 GB total | Qwen-7B, Llama-8B |
Per-Iteration Cost Analysis
Each ES iteration involves N=30 perturbed model evaluations:
Per-iteration compute:
┌────────────────────────────────────────────────────────┐
│ 1. Generate N=30 noise seeds │ ~0 sec │
│ 2. For each of N=30 perturbations: │ │
│ a. Perturb model (layer-by-layer) │ ~2 sec │
│ b. Run inference on training batch │ ~10 sec │
│ c. Compute reward │ ~1 sec │
│ d. Restore model │ ~2 sec │
│ 3. Normalize rewards (z-score) │ ~0 sec │
│ 4. Aggregate update (layer × seed) │ ~5 sec │
├────────────────────────────────────────────────────────┤
│ Total per iteration (serial, 4 GPUs): │ ~2 min │
│ Total per iteration (accelerated, vLLM): │ ~12 sec │
└────────────────────────────────────────────────────────┘
Total Training Cost Estimation
For a typical Countdown experiment (1000 iterations, 4 × H100):
| Version | Time per Iter | Total Time | GPU Hours | Est. Cloud Cost |
|---|---|---|---|---|
| Original | ~2 min | ~33 hours | ~132 H100-hrs | ~$400–$500 |
| Accelerated (10×) | ~12 sec | ~3.3 hours | ~13 H100-hrs | ~$40–$50 |
Cost Comparison with RL
| Method | GPU Memory | Training Time | Hyperparameter Tuning | Total Cost |
|---|---|---|---|---|
| PPO | ~100 GB (8B model) | ~8–24 hours | 3–5 sweeps needed | $500–$2,000 |
| GRPO | ~80 GB (8B model) | ~8–24 hours | 3–5 sweeps needed | $400–$1,500 |
| ES (original) | ~18 GB (8B model) | ~33 hours | 1 config for all models | $400–$500 |
| ES (accelerated) | ~18 GB (8B model) | ~3.3 hours | 1 config for all models | $40–$50 |
The accelerated ES version is dramatically cheaper than RL alternatives, primarily because: 1. No hyperparameter sweeps needed (one config works for all models) 2. Lower memory footprint enables smaller/fewer GPUs 3. vLLM-based inference is highly optimized 4. No backpropagation overhead
Scaling Properties
| Parameter | Cost Impact | Scaling |
|---|---|---|
| Model size (d) | Linear | Larger models = longer inference |
| Population size (N) | Linear | More perturbations = more evaluations |
| Iterations (T) | Linear | More iterations = longer training |
| GPU count | Inverse linear | Parallelism reduces wall time |
| Training data | Sublinear | Batch size matters, not dataset size |
9 Architecture Solution
Algorithm Architecture
The ES fine-tuning system has a clean, modular architecture:
┌─────────────────────────────────────────────────────────────────┐
│ ES FINE-TUNING ARCHITECTURE │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ OUTER LOOP (T iterations) │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ INNER LOOP (N perturbations) │ │ │
│ │ │ │ │ │
│ │ │ For n = 1 to N (parallelizable): │ │ │
│ │ │ ┌─────────────────────────────────────────────┐ │ │ │
│ │ │ │ 1. Sample seed sₙ │ │ │ │
│ │ │ │ 2. Perturb θ in-place (layer by layer): │ │ │ │
│ │ │ │ For each layer ℓ: │ │ │ │
│ │ │ │ εₗ = generate_noise(sₙ, ℓ) │ │ │ │
│ │ │ │ θₗ += σ · εₗ │ │ │ │
│ │ │ │ 3. Generate responses (greedy decoding) │ │ │ │
│ │ │ │ 4. Compute reward Rₙ = R(responses) │ │ │ │
│ │ │ │ 5. Restore θ in-place (layer by layer): │ │ │ │
│ │ │ │ For each layer ℓ: │ │ │ │
│ │ │ │ εₗ = generate_noise(sₙ, ℓ) │ │ │ │
│ │ │ │ θₗ -= σ · εₗ │ │ │ │
│ │ │ └─────────────────────────────────────────────┘ │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Normalize: R̃ₙ = (Rₙ - mean(R)) / std(R) │ │
│ │ │ │
│ │ Update (decomposed, layer by layer, seed by seed): │ │
│ │ For each layer ℓ: │ │
│ │ For each seed sₙ: │ │
│ │ εₗ = generate_noise(sₙ, ℓ) │ │
│ │ θₗ += α · (1/N) · R̃ₙ · εₗ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ EVALUATION │ │
│ │ • Periodic evaluation on held-out test set │ │
│ │ • Checkpoint best-performing model │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Parallelization Strategy
The inner loop (N perturbations) is embarrassingly parallel — each perturbation evaluation is independent:
GPU 0 GPU 1 GPU 2 GPU 3
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Model copy │ │ Model copy │ │ Model copy │ │ Model copy │
│ │ │ │ │ │ │ │
│ Perturb s₁ │ │ Perturb s₂ │ │ Perturb s₃ │ │ Perturb s₄ │
│ Evaluate R₁ │ │ Evaluate R₂ │ │ Evaluate R₃ │ │ Evaluate R₄ │
│ Restore │ │ Restore │ │ Restore │ │ Restore │
│ │ │ │ │ │ │ │
│ Perturb s₅ │ │ Perturb s₆ │ │ Perturb s₇ │ │ Perturb s₈ │
│ Evaluate R₅ │ │ Evaluate R₆ │ │ Evaluate R₇ │ │ Evaluate R₈ │
│ Restore │ │ Restore │ │ Restore │ │ Restore │
│ ... │ │ ... │ │ ... │ │ ... │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │ │
└─────────────┬───────┴─────────────┬───────┘ │
│ │ │
└─────── All-gather rewards ────────────────────────┘
│
▼
Normalize + Aggregate Update
(decomposed across layers × seeds)
With gpu_threads > 1, each GPU can evaluate multiple perturbations concurrently using separate CUDA streams, further increasing parallelism.
Accelerated Architecture (vLLM)
The accelerated version replaces the standard HuggingFace inference with vLLM engines:
┌─────────────────────────────────────────────────────────────────┐
│ ACCELERATED ES ARCHITECTURE │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ vLLM Inference Engines (per GPU) │ │
│ │ │ │
│ │ Engine 0 (GPU 0) Engine 1 (GPU 1) ... Engine K │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │
│ │ │ Continuous │ │ Continuous │ │ Continuous │ │ │
│ │ │ batching │ │ batching │ │ batching │ │ │
│ │ │ PagedAttention│ │ PagedAttention│ │ PagedAttn │ │ │
│ │ │ KV cache │ │ KV cache │ │ KV cache │ │ │
│ │ └──────────────┘ └──────────────┘ └────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Weight Perturbation Manager │ │
│ │ │ │
│ │ • Perturbs vLLM model weights in-place │ │
│ │ • Regenerates KV cache after perturbation │ │
│ │ • Coordinates across engines │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ TensorBoard Logging │ │
│ │ │ │
│ │ • Training curves (reward, accuracy) │ │
│ │ • Per-iteration statistics │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
The key insight: vLLM's continuous batching and PagedAttention make inference dramatically faster than standard HuggingFace generation, yielding the reported 10× speed-up.
10 Component Breakdown
Core Algorithm Components
| Component | Function | Implementation Detail |
|---|---|---|
| Noise Generator | Produces Gaussian perturbations for model parameters | Uses PyTorch RNG with stored seeds; regenerates identical noise from seeds alone |
| Layer-Level Perturbation | Perturbs and restores model weights in-place | Iterates over model layers; temporarily allocates one layer-sized tensor |
| Reward Evaluator | Computes task-specific reward for perturbed model | Greedy decoding → parse response → compute binary/composite reward |
| Reward Normalizer | Z-score normalization within iteration | R̃ = (R - mean(R)) / std(R) ensures consistent scale across iterations |
| Decomposed Updater | Applies aggregated parameter update | Triple-nested loop: layers × seeds × parameters; minimal peak memory |
| Parallelization Manager | Distributes perturbation evaluations across GPUs | Hugging Face Accelerate (original) or custom vLLM dispatcher (accelerated) |
What Is Deliberately Excluded
The paper is notable for what it removes from the standard ES toolbox:
| Standard ES Enhancement | Included? | Rationale |
|---|---|---|
| Rank transformation of rewards | No | Isolates core algorithm performance |
| Mirrored sampling (antithetic pairs) | No | Simplifies implementation |
| Weight decay | No | Avoids interference with analysis |
| Virtual batch normalization | No | Not applicable to LLM architecture |
| Adam optimizer for update | No | Uses simple SGD-style update |
| CMA (covariance adaptation) | No | Full covariance intractable at billion-parameter scale |
"This design choice isolates the core ES algorithm and demonstrates that strong performance can be achieved without auxiliary enhancements. In future work, each individual enhancement can be explored to further improve performance."
This minimalism is a strength of the paper — it demonstrates that raw ES (without engineering tricks) outperforms heavily tuned RL (with tricks and per-model hyperparameter sweeps).
Seven Implementation Innovations
The paper introduces seven implementation modifications that enable billion-parameter ES:
| # | Innovation | Problem Solved | Memory Impact |
|---|---|---|---|
| 1 | Noise retrieval with random seeds | Storing N noise vectors of d dimensions is prohibitive | Store N integers instead of N×d floats |
| 2 | Parallel evaluations | Sequential evaluation of N perturbations is slow | Assign seeds to GPUs; embarrassingly parallel |
| 3 | Layer-level in-place perturbation | Allocating a full noise tensor (billions of floats) is prohibitive | Only one layer-sized tensor in memory at a time |
| 4 | Reward normalization | Reward scales vary across iterations and tasks | Z-score ensures consistent gradient magnitudes |
| 5 | Greedy decoding | Stochastic decoding confounds parameter-space and action-space exploration | Deterministic evaluation isolates parameter-space effects |
| 6 | Decomposed parameter update | Aggregating updates requires materializing full-size tensors | Layer × seed decomposition minimizes peak memory |
| 7 | Learning rate digestion | σ in the update equation adds a redundant hyperparameter | Absorb 1/σ into α, simplifying tuning |
11 Core Mechanisms (Detailed)
Mechanism 1: Natural Evolution Strategies (Simplified)
The theoretical foundation is Natural Evolution Strategies (NES), which views the optimization as operating on the search distribution rather than individual solutions.
Standard NES formulation:
The objective is to maximize the expected reward under a parameterized search distribution (\pi_\psi):
[J(\psi) = \mathbb{E}{\theta \sim \pi\psi}[R(\theta)]]
For a Gaussian search distribution (\pi_\psi = \mathcal{N}(\mu, \Sigma)), the natural gradient update on (\mu) (with fixed (\Sigma = \sigma^2 I)) reduces to:
[\mu \leftarrow \mu + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n \cdot \varepsilon_n]
where (\varepsilon_n \sim \mathcal{N}(0, I)) and (R_n = R(\mu + \sigma \cdot \varepsilon_n)).
Simplifications in this paper:
| Standard NES | This Paper | Effect |
|---|---|---|
| Full covariance Σ | Fixed Σ = σ²I | Eliminates O(d²) covariance estimation |
| Natural gradient on Σ | No adaptation of Σ | σ is fixed hyperparameter |
| Large population (10,000+) | N = 30 | Dramatically reduces per-iteration cost |
| 1/σ in update equation | Absorbed into α | One fewer hyperparameter |
Mechanism 2: Memory-Efficient Perturbation via Seed-Based Noise
The most critical engineering innovation is the use of random seeds to represent perturbation noise implicitly:
Traditional approach (infeasible):
Store N noise vectors, each of dimension d
Memory: N × d × 4 bytes (float32)
For d = 8B, N = 30: 30 × 8×10⁹ × 4 = 960 GB ← IMPOSSIBLE
Seed-based approach (this paper):
Store N random seeds (integers)
Memory: N × 8 bytes (int64) = 240 bytes ← TRIVIAL
To apply perturbation for seed sₙ:
rng = torch.Generator().manual_seed(sₙ)
for each layer ℓ:
εₗ = torch.randn(layer_ℓ.shape, generator=rng)
layer_ℓ.data += σ · εₗ
del εₗ # only one layer-sized tensor at a time
To restore:
rng = torch.Generator().manual_seed(sₙ) # SAME seed!
for each layer ℓ:
εₗ = torch.randn(layer_ℓ.shape, generator=rng)
layer_ℓ.data -= σ · εₗ # subtract to restore
del εₗ
The key insight is that torch.randn with a fixed seed produces identical noise each time. By storing only the seed, the system can regenerate the exact noise for:
1. Applying the perturbation (add σ·ε)
2. Restoring the original parameters (subtract σ·ε)
3. Computing the parameter update (weight by R̃ₙ)
Mechanism 3: Decomposed Parameter Update
The standard update equation requires materializing the full update vector:
Standard: Δθ = α · (1/N) · Σₙ R̃ₙ · εₙ
This requires materializing Σₙ R̃ₙ · εₙ, which is d-dimensional.
For d = 8B: 32 GB in float32.
Decomposed (this paper):
For each layer ℓ:
For each seed sₙ:
εₗ = generate_noise(sₙ, ℓ)
θₗ += α · (1/N) · R̃ₙ · εₗ
del εₗ
Peak memory: one layer-sized tensor (max ~2 GB for largest layers)
This decomposition exploits the linearity of addition: the order of summation doesn't matter, so we can accumulate the update layer-by-layer and seed-by-seed, never materializing the full d-dimensional update vector.
Mechanism 4: Greedy Decoding for Deterministic Evaluation
The use of greedy decoding during evaluation is a deliberate methodological choice:
With stochastic decoding (temperature > 0):
Same model parameters → different responses each time
Source of variation: parameter perturbation + sampling randomness
Cannot attribute performance difference to parameter change alone
With greedy decoding (temperature = 0):
Same model parameters → identical response every time
Source of variation: parameter perturbation only
Clean attribution: any performance difference is due to the perturbation
This is analogous to controlled experiments in science — by eliminating one source of variation (decoding randomness), the system can attribute performance differences purely to the parameter perturbation, enabling a cleaner gradient estimate.
Mechanism 5: Z-Score Reward Normalization
Reward normalization is critical for stable training:
Raw rewards (iteration t): [0.0, 0.0, 1.0, 0.0, ..., 1.0] (binary)
Mean: 0.2, Std: 0.4
Normalized: [-0.5, -0.5, 2.0, -0.5, ..., 2.0]
Effect on update:
Perturbations that led to correct answers (R̃ > 0) are reinforced
Perturbations that led to incorrect answers (R̃ < 0) are suppressed
The magnitude of reinforcement/suppression is proportional to
how far above/below average the reward is
Without normalization, the update magnitude would depend on the absolute reward scale, which varies across tasks and training stages. Z-score normalization ensures consistent gradient magnitudes, removing a potential source of training instability.
Mechanism 6: Why Small Populations Work
The most surprising finding is that N=30 suffices for billion-parameter spaces. The paper does not provide a full theoretical explanation, but the results suggest several hypotheses:
-
Effective dimensionality is much lower than nominal dimensionality. LLM parameter spaces are highly structured — most directions in parameter space have minimal impact on output. The 30 perturbations may explore the effective subspace efficiently.
-
The pre-trained model provides a strong initialization. ES is not searching from scratch — it is fine-tuning from a pre-trained model that already has the right structure. Small perturbations around this initialization are sufficient to discover improved parameters.
-
Response-level reward provides a strong signal. For binary rewards (correct/incorrect), even a small population can identify which perturbations are beneficial because the reward signal is clear and unambiguous.
-
Greedy decoding amplifies perturbation effects. Without sampling noise, even small parameter changes produce detectable output differences, making each perturbation informative.
12 Programming Language
Implementation Stack
| Component | Language | Framework | Notes |
|---|---|---|---|
| ES algorithm | Python 3.10+ | PyTorch | Core training loop, noise generation, update |
| Parallelization | Python | Hugging Face Accelerate | Multi-GPU distribution (original version) |
| Accelerated inference | Python | vLLM | High-throughput inference engine (accelerated version) |
| Reward computation | Python | Custom | Task-specific reward functions |
| Logging | Python | TensorBoard | Training curves and per-iteration stats (accelerated) |
| Model loading | Python | Hugging Face Transformers | Loading pre-trained LLMs |
Code Structure
es-fine-tuning-paper/
├── es_fine-tuning_conciseness.py # Conciseness task (correlated noise)
├── es_fine-tuning_conciseness_iid.py # Conciseness task (i.i.d. noise)
├── es_fine-tuning_countdown_accl.py # Countdown (accelerated, vLLM)
├── countdown/
│ ├── es_fine-tuning_countdown.py # Countdown task (correlated noise)
│ └── es_fine-tuning_countdown_iid.py # Countdown task (i.i.d. noise)
└── requirement.txt # Dependencies
Dependency Analysis
Key dependencies include:
| Package | Purpose | Version Constraint |
|---|---|---|
torch |
Tensor operations, noise generation, model manipulation | ≥ 2.0 |
transformers |
Model loading, tokenization | Recent |
accelerate |
Multi-GPU distribution | Recent |
vllm |
High-throughput inference (accelerated version) | 0.11.0 |
tensorboard |
Training visualization (accelerated version) | Any |
Code Complexity
The implementation is notably compact:
| File | Estimated LOC | Complexity |
|---|---|---|
es_fine-tuning_conciseness.py |
~300–500 | Medium |
countdown/es_fine-tuning_countdown.py |
~300–500 | Medium |
es_fine-tuning_countdown_accl.py |
~500–800 | Medium–High (vLLM integration) |
| Total | ~1,100–1,800 | Medium |
This compactness is a strength — the entire ES fine-tuning pipeline fits in a single readable file, making the algorithm transparent and easy to modify.
Why Python?
Python is the obvious choice for several reasons: 1. PyTorch native. All tensor operations, GPU management, and model manipulation use PyTorch 2. HuggingFace ecosystem. Model loading, tokenization, and the Accelerate library are Python-native 3. vLLM. The accelerated version uses vLLM, which is a Python library 4. Research community. Python is the lingua franca of ML research
The per-iteration overhead of Python is negligible compared to the GPU computation time for inference and noise generation.
13 Memory Management
GPU Memory Analysis
The paper's most significant practical contribution is its memory efficiency. The analysis below compares ES and RL memory requirements for an 8B-parameter model in bf16 precision:
RL Fine-Tuning (PPO) Memory Breakdown:
| Component | Size (GB) | Notes |
|---|---|---|
| Model parameters (bf16) | 16 | 8B × 2 bytes |
| Gradient buffers (fp32) | 32 | 8B × 4 bytes (full precision gradients) |
| Adam optimizer states (m, v) | 64 | 2 × 8B × 4 bytes |
| Activation cache | 8–32 | Depends on batch size, sequence length |
| Reference model (KL penalty) | 16 | Full copy of base model |
| Total | 136–160 | Requires 2–4 × A100-80GB |
ES Fine-Tuning Memory Breakdown:
| Component | Size (GB) | Notes |
|---|---|---|
| Model parameters (bf16) | 16 | 8B × 2 bytes |
| Layer-sized noise tensor (temp) | 0.1–2 | Largest single layer; allocated/freed per layer |
| Random seeds | < 0.001 | 30 integers |
| Inference KV cache | 2–4 | For greedy decoding |
| Total | 18–22 | Fits on 1 × A100-80GB |
Memory reduction: ~7–8×
Memory-Efficient Operations
Three operations dominate the memory management strategy:
1. In-Place Perturbation:
# Perturb model in-place (layer by layer)
for layer in model.layers:
noise = torch.randn(layer.shape, generator=rng, device=device)
layer.data.add_(noise, alpha=sigma) # in-place!
del noise # free immediately
2. In-Place Restoration:
# Restore model in-place (regenerate same noise)
rng.manual_seed(seed) # reset to same seed
for layer in model.layers:
noise = torch.randn(layer.shape, generator=rng, device=device)
layer.data.sub_(noise, alpha=sigma) # subtract = restore
del noise
3. Decomposed Update:
# Update in-place (layer by layer, seed by seed)
for layer_idx, layer in enumerate(model.layers):
for seed_idx, (seed, reward) in enumerate(zip(seeds, normalized_rewards)):
rng.manual_seed(seed)
# skip to correct layer
for skip in range(layer_idx):
torch.randn(model.layers[skip].shape, generator=rng)
noise = torch.randn(layer.shape, generator=rng, device=device)
layer.data.add_(noise, alpha=lr * reward / N) # in-place!
del noise
Memory Scaling with Model Size
| Model | Parameters | RL Memory | ES Memory | Savings |
|---|---|---|---|---|
| 0.5B | 0.5B | ~12 GB | ~2 GB | 6× |
| 1.5B | 1.5B | ~25 GB | ~4 GB | 6× |
| 3B | 3B | ~48 GB | ~8 GB | 6× |
| 7B | 7B | ~110 GB | ~16 GB | 7× |
| 8B | 8B | ~140 GB | ~18 GB | 8× |
The memory savings scale better for larger models because RL's gradient and optimizer state overhead grows with parameter count, while ES's overhead remains constant (one layer-sized tensor).
Comparison with Memory-Efficient RL Alternatives
| Method | Memory (8B model) | Fine-Tuning Quality | Parameter Coverage |
|---|---|---|---|
| Full RL (PPO/GRPO) | ~140 GB | Baseline | Full parameters |
| LoRA RL | ~30–40 GB | Slightly reduced | Low-rank adapters only |
| QLoRA RL | ~20–30 GB | Reduced | Quantized + low-rank |
| MeZO (zeroth-order) | ~20 GB | Poor (below baselines) | Full parameters |
| ES (this paper) | ~18 GB | Exceeds full RL | Full parameters |
ES achieves the lowest memory footprint while maintaining the highest fine-tuning quality — a combination that no prior method achieved.
14 Continued Learning
Within-Run Learning Dynamics
The ES optimization trajectory exhibits distinctive learning dynamics compared to RL:
RL learning curve (typical):
Accuracy
│ ┌─── plateau / oscillation ───┐
│ / \____
│ / \── reward hacking begins
│ /
│ /
│ / rapid initial improvement
│ /
│ /
└─────────────────────────────────────── Iterations
ES learning curve (typical):
Accuracy
│ ┌──── continued gradual improvement
│ /
│ /
│ /
│ /
│ /
│ /
│ /
│ /
│ / steady, monotonic improvement
│ /
│ /
│ /
└─────────────────────────────────────── Iterations
ES exhibits slower but more steady improvement without the plateaus, oscillations, or reward hacking that characterize RL fine-tuning. The learning curve is more predictable, enabling better estimation of required compute budgets.
Population Dynamics
Unlike population-based evolutionary systems (AlphaEvolve, OpenEvolve), ES in this paper maintains a single model with a distribution around it:
Iteration t: Iteration t+1:
(center shifted toward high-reward direction)
· · · ·
· · · · · · · ·
· · θₜ · · ────► · · θₜ₊₁ · ·
· · · · · · · ·
· · · ·
The distribution (cloud) moves through parameter space.
Individual perturbations are transient — only the center persists.
This is fundamentally different from population-based approaches where multiple diverse solutions coexist. The trade-off:
| Property | ES (Single Center + Distribution) | Population-Based (AlphaEvolve) |
|---|---|---|
| Diversity | Low (Gaussian around one point) | High (multiple diverse solutions) |
| Memory | Very low | High (N full models) |
| Exploration | Local (radius σ around center) | Global (multiple starting points) |
| Risk of local optima | Higher | Lower |
| Implementation complexity | Very low | High |
Cross-Task Transfer
The paper does not investigate cross-task transfer — each experiment starts from a pre-trained model and fine-tunes for a single task. However, the results suggest several transfer possibilities:
- Sequential fine-tuning. An ES-tuned model for one task could serve as the starting point for ES fine-tuning on a second task.
- Multi-task reward. A composite reward function combining multiple task metrics could enable simultaneous multi-task fine-tuning.
- Curriculum learning. Starting with an easy task (high reward signal) and progressively adding harder tasks could improve sample efficiency.
None of these are explored in the paper, representing opportunities for future work.
Continued Improvement Beyond Reported Results
The paper's accelerated version (10× speed-up) suggests that the original results may not represent the limit of ES performance. With 10× faster iterations, significantly more iterations become practical, potentially yielding better final performance.
The paper also notes that standard ES enhancements (mirrored sampling, rank transformation, Adam optimizer) were deliberately excluded. Adding these could further improve performance, and the paper explicitly invites this future work.
15 Applications
Primary Application: LLM Post-Training
The paper positions ES as a general-purpose post-training paradigm for LLMs. Current applications demonstrated:
| Application | Task Type | Reward Structure | ES Advantage |
|---|---|---|---|
| Reasoning fine-tuning | Symbolic reasoning (Countdown) | Binary outcome | Long-horizon tolerance, cross-model robustness |
| Math reasoning | GSM8K, MATH500, etc. | Binary correctness | Competitive with SOTA RL, more stable |
| Behavioral tuning | Conciseness optimization | Composite (quality + length) | Reward hacking resistance |
| Puzzle solving | Number sequences, logic grids | Binary/custom | Novel solutions unreachable by base models |
Future Applications Suggested by the Paper
-
RLHF replacement. ES could replace PPO/GRPO in the RLHF pipeline, using human preference reward models but optimizing via ES rather than RL. The reward hacking resistance is particularly valuable for alignment.
-
Instruction following. Fine-tuning models to follow instructions more precisely, where correctness is binary (followed instruction or didn't).
-
Code generation. Fine-tuning for code generation tasks where reward is based on test case passage — a naturally long-horizon, binary-outcome task.
-
Safety alignment. ES's resistance to reward hacking makes it a candidate for safety-critical alignment tasks, where exploiting loopholes in the reward function is a major concern.
-
Distributed fine-tuning. ES's embarrassingly parallel nature makes it ideal for distributed fine-tuning across multiple machines or even multiple data centers. Only scalar rewards need to be communicated, not gradients.
Implications for the Evolutionary AI Field
This paper has several implications for the broader evolutionary AI landscape:
1. Neuroevolution is back. The paper revives direct parameter-space optimization of neural networks — a field that had been dormant since the early 2010s when gradient-based methods became dominant. By showing that ES works at billion-parameter scale, it reopens research directions that were considered closed.
2. Zeroth-order optimization is viable. The success of ES (a zeroth-order method) at scale challenges the assumption that gradient information is necessary for efficient optimization of LLMs. This opens the door to other zeroth-order methods (CMA-ES, random search, simulated annealing) being applied to LLMs.
3. Backpropagation is not always necessary. For post-training (as opposed to pre-training), backpropagation may not be the optimal optimization strategy. ES's ability to avoid backprop has practical benefits (memory, simplicity) without sacrificing quality.
4. Small populations suffice. The N=30 finding challenges the conventional wisdom in evolutionary computation that population size must scale with problem dimensionality. This has implications for all population-based optimization methods applied to high-dimensional problems.
Relevance to OmniEvolve
This paper is highly relevant to the OmniEvolve project from multiple angles:
| OmniEvolve Component | Relevance | Integration Potential |
|---|---|---|
| Search backends | ES could be a search backend for parameter-space optimization | Implement as ESSearchBackend with layered perturbation |
| Mutation operators | Gaussian perturbation is a principled mutation operator | Complement LLM-based mutations with ES-style perturbations |
| Evaluation | Greedy decoding + binary reward is a clean evaluation pattern | Adopt for tasks with binary correctness |
| Memory management | Seed-based noise storage is an efficient memory pattern | Apply to candidate storage in general |
| Benchmarks | Countdown task is a well-defined benchmark | Include in benchmark suite |
Limitations
The paper acknowledges several limitations:
- Scale ceiling unknown. The largest model tested is 8B parameters. Whether ES remains effective at 70B or 405B is an open question.
- Convergence speed. ES is slower per-iteration than RL for some tasks. The total compute may be higher, even if the per-run cost is lower (due to hyperparameter stability).
- Exploration radius. With fixed σ, the exploration radius is limited. Adaptive σ (as in CMA-ES) could improve performance but adds complexity.
- No theoretical guarantees. The paper provides empirical evidence but no convergence guarantees for ES in billion-parameter spaces. The theoretical analysis of Vemula et al. (2019) would predict poor performance, which is empirically falsified but not theoretically explained.
- Limited to post-training. ES is applied to fine-tuning from pre-trained models, not to pre-training from scratch. Whether ES could scale to pre-training is an open question.
Impact Assessment
| Dimension | Assessment |
|---|---|
| Scientific novelty | Very High — first successful full-parameter ES at billion scale |
| Practical utility | High — cheaper, more stable, no reward hacking |
| Reproducibility | High — full code, public models, fixed hyperparameters |
| Generality | High — works across model families, sizes, and tasks |
| Theoretical depth | Medium — empirical strength, limited theoretical explanation |
| Community adoption | Growing — 340+ stars, ICML acceptance, active discussions |
| Long-term impact | Potentially transformative — could reshape post-training paradigm |
Position in the Evolutionary AI Landscape
Parameter Space ─────────────────────────────────────► Action/Code Space
│ │
│ ES at Scale AlphaEvolve / FunSearch │
│ (this paper) (Google DeepMind) │
│ ┌───────────┐ ┌─────────────────────┐ │
│ │ Perturb │ │ LLM generates code │ │
│ │ model │ │ mutations; evaluator │ │
│ │ weights │ │ scores code quality │ │
│ │ directly │ │ │ │
│ └───────────┘ └─────────────────────┘ │
│ │
│ Sakana Model Merging EvoPrompting │
│ ┌───────────┐ ┌─────────────────────┐ │
│ │ Evolve │ │ Evolve prompts / │ │
│ │ merging │ │ prompt templates │ │
│ │ weights │ │ │ │
│ └───────────┘ └─────────────────────┘ │
│ │
Low-level ◄─────────────────────────────────────── High-level
(continuous) (discrete/symbolic)
ES at Scale occupies the lowest-level, most direct position in this landscape — it evolves the raw parameters of the model, without any abstraction layer (code, prompts, merging weights). This directness is both its strength (no information loss through abstraction) and its limitation (exploration is local in parameter space).