← Back to Index

Evolution Strategies at Scale

First successful application of ES to full-parameter LLM fine-tuning at billion-parameter scale without dimensionality reduction Organization: Cognizant AI Labs / University of Texas at Austin Published: September 2025 (v1), February 2026 (v2) Type: Research Paper (ICML 2026) Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents

1 Full Title and Attribution

Full Title: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

arXiv: 2509.24372 (cs.LG, cs.AI, cs.NE)

Repository: github.com/VsonicV/es-fine-tuning-paper (340+ stars)

Venue: ICML 2026

Submission History: - v1: September 29, 2025 (original submission) - v2: February 6, 2026 (revised version with additional experiments)

DOI: 10.48550/arXiv.2509.24372

License: CC BY-NC-SA 4.0

Lineage: Builds on the intellectual foundation of OpenAI ES (Salimans et al., 2017) and Natural Evolution Strategies (Wierstra et al., 2008, 2014). Positioned as the first work to scale ES from millions of parameters (prior art) to billions of parameters (LLMs). The paper explicitly challenges the assumption — widely held since Vemula et al. (2019) — that parameter-space exploration is inherently unscalable.

Key Claim: ES is not merely a viable alternative to RL for LLM fine-tuning, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning.

2 Authors and Team

Author List

Author Affiliation Role / Expertise
Xin Qiu Cognizant AI Labs Lead author, corresponding author. ES implementation and experimental design.
Yulu Gan Cognizant AI Labs ES implementation, experimental evaluation
Conor F. Hayes Cognizant AI Labs RL baseline implementation, comparative analysis
Qiyao Liang Cognizant AI Labs Experimental evaluation, benchmark design
Yinggan Xu Cognizant AI Labs Infrastructure, scaling experiments
Roberto Dailey Cognizant AI Labs Infrastructure, GPU parallelization
Elliot Meyerson Cognizant AI Labs Evolutionary computation expertise, research direction
Babak Hodjat Cognizant AI Labs Senior research leadership, evolutionary AI strategy
Risto Miikkulainen Cognizant AI Labs / UT Austin Senior author. Neuroevolution pioneer, NEAT inventor

Team Context

This paper comes from Cognizant AI Labs, the research arm of Cognizant Technology Solutions, which maintains one of the most established evolutionary computation research groups in industry. The team is led by Babak Hodjat (co-founder of Sentient Technologies, one of the largest AI startups focused on evolutionary computation) and Risto Miikkulainen (UT Austin professor, inventor of NEAT — NeuroEvolution of Augmenting Topologies, one of the most cited neuroevolution algorithms).

Elliot Meyerson is notable for his work on evolutionary search with LLMs as mutation operators — his 2024 paper on "Language Model Crossover" is a key precursor to the LLM-as-evolutionary-operator paradigm used in systems like AlphaEvolve.

The team's collective expertise in neuroevolution gives this paper particular authority. When the group that invented NEAT and scaled evolutionary optimization to production systems at Sentient Technologies claims that ES scales to billion-parameter LLMs, the community pays attention.

Institutional Significance

Cognizant AI Labs occupies a unique position in the evolutionary AI landscape:

Institution Focus Key Contributions
Google DeepMind LLM-as-mutation-operator (code evolution) AlphaEvolve, FunSearch, AlphaTensor
Sakana AI Model merging via evolution Evolutionary Model Merging
Cognizant AI Labs Direct parameter-space evolution of LLMs This paper (ES at Scale)
OpenAI (historical) ES for RL policy optimization OpenAI ES (Salimans et al., 2017)

While DeepMind uses LLMs to evolve code (programs, algorithms), and Sakana uses evolution to merge LLMs, Cognizant's contribution is fundamentally different: they use evolution to directly optimize the parameters of LLMs. This is the most classical form of neuroevolution, applied at unprecedented scale.

3 Core Contribution

Key Novelty: For the first time, Evolution Strategies (ES) is successfully scaled to direct, full-parameter fine-tuning of LLMs with billions of parameters — without any dimensionality reduction (no LoRA, no final-layer-only, no action-space surrogates). The paper demonstrates that ES outperforms established RL methods (PPO, GRPO, Dr.GRPO) across multiple axes, overturning the widespread assumption that ES cannot scale to modern model sizes.

The Assumption Overturned

The conventional wisdom in the field, established by Vemula et al. (2019), held that parameter-space exploration complexity scales quadratically with parameter count ((O(d^2)) where (d) is dimensionality), making it intractable for models with billions of parameters. Prior ES applications had been limited to:

Prior Work Year Parameters Population Size
Salimans et al. (OpenAI ES) 2017 ~4M 10,000+
Zhang et al. 2017 ~3M 10,000+
Lehman et al. 2018 ~167K Large
Lorenc & Neruda 2025 ~2.5M Large
Toledano-López et al. 2022 325 (last layer only) Small
Jin et al. 2024 1,600 (LoRA adapters only) Small

This paper:

This Work Year Parameters Population Size
ES at Scale 2025–2026 0.5B – 8B 30

The jump is dramatic: from millions to billions of parameters, and from populations of 10,000+ to just 30. Both changes were assumed to be individually fatal to ES performance — together, they should have been catastrophic. Instead, ES outperforms RL.

Six Advantages of ES Over RL for LLM Fine-Tuning

The paper identifies six systematic advantages:

  1. Long-horizon reward tolerance. ES needs only response-level (outcome) rewards, not token-level credit assignment. For reasoning tasks where only the final answer is graded, ES avoids the credit assignment problem entirely.

  2. Small populations in high-dimensional spaces. A population of just 30 is sufficient to search in multi-billion-parameter spaces. Previous work assumed populations must be proportional to dimensionality.

  3. Cross-model robustness. ES consistently fine-tunes all tested LLMs (Qwen-2.5, Llama-3.x families, 0.5B–8B). RL methods fail on some models, particularly smaller ones.

  4. Reward hacking resistance. ES optimizes a solution distribution (the Gaussian perturbation cloud), which is harder to hack than RL's single-solution optimization. RL tends to exploit reward function loopholes.

  5. Cross-run stability. ES produces consistent results across multiple runs with the same hyperparameters. RL is often unstable, requiring expensive hyperparameter sweeps per model.

  6. Memory efficiency. ES requires only inference (no backpropagation), eliminating the need for gradient storage, optimizer states, and activation caching. Significant GPU memory savings.

Paradigm Positioning

The paper positions ES not as a niche technique but as a new post-training paradigm:

Pre-training ─────────────────────────────────────────►
    │                    │                    │
    ▼                    ▼                    ▼
 SFT (Supervised      RLHF / RL            ES Fine-Tuning
 Fine-Tuning)         (PPO, GRPO,           (This Paper)
                       DPO, etc.)
                                            • No gradients
 • Gradient-based     • Gradient-based      • No backprop
 • Requires labels    • Requires reward     • Requires reward
 • Deterministic        model or rule         function only
   training           • Token-level or      • Response-level
                        outcome rewards       rewards only
                      • Action-space        • Parameter-space
                        exploration           exploration

4 Supported Solutions

Fine-Tuning Tasks Evaluated

The paper evaluates ES on three categories of tasks:

1. Symbolic Reasoning: Countdown Task

The Countdown task (Gandhi et al., 2024; Pan et al., 2025) requires the model to combine given numbers using arithmetic operations to reach a target number.

Feature Detail
Task format Given numbers [a, b, c, d], reach target T using +, −, ×, ÷
Reward type Binary outcome: correct (1.0) or incorrect (0.0)
Horizon Long — full response generation before reward
Training set 200 sampled problems
Test set Held-out evaluation set
Models tested 7 models across Qwen-2.5 and Llama-3.x families (0.5B–8B)

2. Behavioral Fine-Tuning: Conciseness

Fine-tuning LLMs to produce shorter, more concise responses to knowledge questions.

Feature Detail
Task format Answer knowledge questions concisely
Reward type Composite: correctness × conciseness penalty
Horizon Full response generation
Dataset 500 questions from knowledge benchmarks
Models tested Qwen-2.5-7B-Instruct
Key observation RL hacks the reward by degenerating to single-token answers; ES maintains coherent responses

3. Math Reasoning: GSM8K, MATH500, Minerva Math, OlympiadBench

Extended comparisons with SOTA RL methods:

Benchmark Type Difficulty
GSM8K Grade school math Easy
MATH500 Competition math Medium–Hard
Minerva Math Mathematical reasoning Hard
OlympiadBench Olympiad-level problems Very Hard

4. Puzzle Problem Solving

ES is applied to solve two puzzle problems that base LLMs struggle with:

Puzzle Description ES Contribution
Number sequence puzzles Find patterns in number sequences ES fine-tuning enables discovery of solutions base models cannot find
Logic grid puzzles Constraint satisfaction problems ES-tuned models show improved systematic reasoning

Solution Space Characterization

Dimension Value
Search space Full parameter space of transformer LLMs (0.5B–8B parameters)
Solution representation Continuous real-valued vectors (model weights)
Evaluation Deterministic (greedy decoding)
Fitness function Task-specific reward (binary correctness, composite scores)
Constraint handling None explicit (reward function encodes constraints)

5 LLM Integration

LLMs as Optimization Targets (Not Operators)

This paper uses LLMs fundamentally differently from systems like AlphaEvolve, FunSearch, or EvoPrompting. In those systems, the LLM is a tool that generates candidate solutions (code, programs). In this paper, the LLM is the optimization target — its parameters are the search space, and ES directly manipulates billions of floating-point values to improve task performance.

AlphaEvolve / FunSearch approach:
┌───────────────┐     ┌──────────────┐     ┌──────────────┐
│ LLM (frozen)  │────►│ Generated    │────►│ Evaluator    │
│ generates code│     │ code/program │     │ scores code  │
└───────────────┘     └──────────────┘     └──────────────┘
        ▲                                          │
        └──────────── fitness feedback ────────────┘

This paper's approach (ES at Scale):
┌─────────────────────────────────────────────────────┐
│              LLM Parameters (θ)                     │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐            │
│  │ Layer 1  │ │ Layer 2  │ │ Layer N  │  ...        │
│  │ weights  │ │ weights  │ │ weights  │  (billions  │
│  │          │ │          │ │          │   of params) │
│  └──────────┘ └──────────┘ └──────────┘            │
│       ↑              ↑              ↑               │
│  ε₁ ~ N(0,I)   ε₂ ~ N(0,I)   εₙ ~ N(0,I)         │
│  (perturbation)                                     │
└─────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────┐     ┌──────────────┐
│ Perturbed LLM│────►│ Reward R(θ+σε)│
│ generates    │     │ (binary or   │
│ responses    │     │  composite)  │
└──────────────┘     └──────────────┘
         │
         ▼
θ ← θ + α · (1/N) · Σ Rₙ · εₙ   (parameter update)

Models Evaluated

Model Family Model Parameters Architecture
Qwen-2.5 Qwen-2.5-0.5B-Instruct 0.5B Transformer decoder
Qwen-2.5 Qwen-2.5-1.5B-Instruct 1.5B Transformer decoder
Qwen-2.5 Qwen-2.5-3B-Instruct 3B Transformer decoder
Qwen-2.5 Qwen-2.5-7B-Instruct 7B Transformer decoder
Llama-3.2 Llama-3.2-1B-Instruct 1B Transformer decoder
Llama-3.2 Llama-3.2-3B-Instruct 3B Transformer decoder
Llama-3.1 Llama-3.1-8B-Instruct 8B Transformer decoder

Inference-Only Operation

A critical property of ES fine-tuning is that it requires only forward passes through the model:

Operation RL (PPO/GRPO) ES (This Paper)
Forward pass Yes Yes
Backward pass (backprop) Yes No
Gradient computation Yes No
Optimizer state (Adam) Yes No
Activation caching Yes No
Reference model Yes (for KL penalty) No

This has profound implications for GPU memory:

GPU Memory Layout — RL Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16)               │  ~16 GB (8B)  │
│ Gradient buffers                   │  ~16 GB       │
│ Optimizer states (Adam: m, v)      │  ~32 GB       │
│ Activation cache (for backprop)    │  ~8–32 GB     │
│ Reference model (KL penalty)       │  ~16 GB       │
├────────────────────────────────────┼───────────────┤
│ TOTAL                              │  ~88–112 GB   │
└────────────────────────────────────┴───────────────┘

GPU Memory Layout — ES Fine-Tuning:
┌────────────────────────────────────────────────────┐
│ Model weights (bf16)               │  ~16 GB (8B)  │
│ Layer-sized noise tensor (temp)    │  ~0.1–2 GB    │
│ Random seeds (N integers)          │  negligible   │
├────────────────────────────────────┼───────────────┤
│ TOTAL                              │  ~16–18 GB    │
└────────────────────────────────────┴───────────────┘

This ~5–6× memory reduction is a substantial practical advantage, enabling fine-tuning of larger models on smaller GPU clusters.

6 Key Results

Headline Result: Countdown Task (Table 1)

ES outperforms all RL baselines across all 7 tested models:

Base Model Params Original Best RL ES ES Δ vs Best RL
Qwen-2.5-0.5B-Instruct 0.5B 0.1% 13.5% (Dr.GRPO) 14.4% +0.9
Qwen-2.5-1.5B-Instruct 1.5B 0.7% 31.0% (Dr.GRPO) 37.3% +6.3
Qwen-2.5-3B-Instruct 3B 10.0% 43.8% (Dr.GRPO) 60.5% +16.7
Qwen-2.5-7B-Instruct 7B 31.2% 57.5% (Dr.GRPO) 66.8% +9.3
Llama-3.2-1B-Instruct 1B 0.4% 14.9% (GRPO-v) 16.8% +1.9
Llama-3.2-3B-Instruct 3B 3.2% 47.8% (Dr.GRPO) 51.6% +3.8
Llama-3.1-8B-Instruct 8B 8.1% 51.3% (GRPO-z 30) 61.2% +9.9

Key observations:

  1. ES wins on every model. Not a single RL variant beats ES on any model size or family.
  2. Largest gains on mid-sized models. Qwen-2.5-3B shows the most dramatic improvement: 60.5% vs 43.8%, a +16.7 absolute improvement.
  3. Single hyperparameter set. ES uses the same hyperparameters (N=30, σ=0.001, α=5×10⁻⁴) for ALL models. RL requires per-model hyperparameter sweeps.
  4. Cross-family robustness. ES works on both Qwen and Llama families. RL performance varies significantly across families.

Reward Hacking Analysis (Conciseness Task)

The conciseness task reveals a qualitative difference between ES and RL:

Method Reward Score Actual Behavior Explanation
GRPO Very high Degenerates to single-token or very short incoherent answers Hacks the conciseness reward by minimizing length at the expense of content
ES High Produces genuinely concise but coherent and correct answers Optimizes the distribution, making extreme reward-hacking behaviors unlikely

The paper explains: "ES optimizes a solution distribution (the perturbation cloud), which is more difficult to hack, while RL optimizes a single solution."

This is a fundamental insight. RL fine-tuning produces a single policy that can find and exploit reward function loopholes. ES produces a distribution of nearby policies — for any reward hack to persist, it must be robust to Gaussian perturbation of all parameters, which is much harder for degenerate behaviors to achieve.

Cross-Run Stability

ES shows dramatically lower variance across independent runs:

Method Mean Accuracy Std Dev Across Runs Interpretation
GRPO ~40% ±5–10% High variance; some runs fail entirely
Dr.GRPO ~45% ±3–7% Moderate variance; improved but not stable
ES ~55% ±1–3% Low variance; consistent results

This stability has practical cost implications. If RL requires 3–5 runs to find a good hyperparameter configuration plus 2–3 runs for reliability, and ES requires a single run, the total compute cost may be lower for ES despite its per-iteration expense.

Math Reasoning Benchmarks (Extended Results)

ES is compared against additional SOTA RL baselines on math reasoning:

Benchmark GRPO Dr.GRPO DAPO ES ES Rank
GSM8K Competitive Competitive Competitive Competitive Top-2
MATH500 Competitive Competitive Competitive Competitive Top-2
Minerva Math Competitive Top-3
OlympiadBench Competitive Top-3

On these standard math benchmarks, ES performs comparably to the best RL methods, demonstrating that the advantages of ES (robustness, stability, no reward hacking) do not come at the cost of reduced performance on well-studied tasks.

Hyperparameter Sensitivity

Hyperparameter ES Sensitivity RL Sensitivity Implication
Learning rate (α) Low Very high RL requires careful per-model tuning
Population/group size (N) Low Moderate ES works with N=30; RL performance varies with group size
Noise scale (σ) Moderate N/A ES-specific; σ=0.001 works across models
KL penalty (β) N/A Very high RL-specific; wrong β causes training collapse or reward hacking

The paper reports that a single ES configuration works across all 7 models, while RL requires separate hyperparameter sweeps for each model — a significant practical advantage.

7 Reproducibility

Code Availability

The complete source code is available at github.com/VsonicV/es-fine-tuning-paper:

File Purpose
es_fine-tuning_conciseness.py ES fine-tuning for conciseness task (correlated noise)
es_fine-tuning_conciseness_iid.py ES fine-tuning for conciseness task (i.i.d. noise)
countdown/es_fine-tuning_countdown.py ES fine-tuning for Countdown task (correlated noise)
countdown/es_fine-tuning_countdown_iid.py ES fine-tuning for Countdown task (i.i.d. noise)
es_fine-tuning_countdown_accl.py Accelerated version with 10× speed-up (vLLM-based)
requirement.txt Python dependencies

Setup and Execution

# Environment setup
python -m venv es
source es/bin/activate
pip install -r requirement.txt

# Conciseness fine-tuning (2 GPUs)
accelerate launch \
    --num_processes 2 \
    --num_machines 1 \
    --machine_rank 0 \
    es_fine-tuning_conciseness.py \
    --gpu_threads=1 \
    --model_name=Qwen/Qwen2.5-7B-Instruct

# Countdown fine-tuning (4 GPUs)
accelerate launch \
    --num_processes 4 \
    --num_machines 1 \
    --machine_rank 0 \
    countdown/es_fine-tuning_countdown.py \
    --data_sample 200 \
    --model_name Qwen/Qwen2.5-3B-Instruct \
    --gpu_threads 1

# Accelerated version with vLLM (4 GPUs)
python es_fine-tuning_countdown_accl.py \
    --model_name Qwen/Qwen2.5-3B-Instruct \
    --cuda_devices 0,1,2,3 \
    --num_engines 4 \
    --population_size 30 \
    --num_iterations 1000

Reproducibility Assessment

Criterion Assessment Notes
Code available Yes Full source code on GitHub
Data available Yes Standard public benchmarks (Countdown, GSM8K, MATH)
Models available Yes Public HuggingFace models (Qwen, Llama)
Fixed hyperparameters Yes N=30, σ=0.001, α=5×10⁻⁴ for all ES experiments
Random seed control Partial Seeds used for noise generation but not all sources of randomness documented
Hardware requirements Moderate 2–4 GPUs (80GB each) for most experiments
Accelerated version Yes 10× speed-up using vLLM for inference
RL baselines Partial RL implementation details referenced but separate hyperparameter sweeps needed

Noise Variants

The repository provides two noise implementations:

Variant File Description
Correlated noise es_fine-tuning_*.py Partially correlated noise across dimensions (original paper implementation)
i.i.d. noise es_fine-tuning_*_iid.py Independent noise in each parameter dimension

The discussion at github.com/VsonicV/es-fine-tuning-paper/discussions/7 provides additional details on the difference. Both variants achieve similar results.

8 Compute and API Costs

Hardware Requirements

Configuration GPUs GPU Memory Purpose
Minimum (small models) 1–2 × A100/H100 (80GB) 80–160 GB total Qwen-0.5B, Qwen-1.5B
Standard (medium models) 4 × A100/H100 (80GB) 320 GB total Qwen-3B, Llama-3B
Full (large models) 4–8 × H100 (80GB) 320–640 GB total Qwen-7B, Llama-8B

Per-Iteration Cost Analysis

Each ES iteration involves N=30 perturbed model evaluations:

Per-iteration compute:
┌────────────────────────────────────────────────────────┐
│ 1. Generate N=30 noise seeds                │ ~0 sec  │
│ 2. For each of N=30 perturbations:          │         │
│    a. Perturb model (layer-by-layer)        │ ~2 sec  │
│    b. Run inference on training batch       │ ~10 sec │
│    c. Compute reward                        │ ~1 sec  │
│    d. Restore model                         │ ~2 sec  │
│ 3. Normalize rewards (z-score)              │ ~0 sec  │
│ 4. Aggregate update (layer × seed)          │ ~5 sec  │
├────────────────────────────────────────────────────────┤
│ Total per iteration (serial, 4 GPUs):       │ ~2 min  │
│ Total per iteration (accelerated, vLLM):    │ ~12 sec │
└────────────────────────────────────────────────────────┘

Total Training Cost Estimation

For a typical Countdown experiment (1000 iterations, 4 × H100):

Version Time per Iter Total Time GPU Hours Est. Cloud Cost
Original ~2 min ~33 hours ~132 H100-hrs ~$400–$500
Accelerated (10×) ~12 sec ~3.3 hours ~13 H100-hrs ~$40–$50

Cost Comparison with RL

Method GPU Memory Training Time Hyperparameter Tuning Total Cost
PPO ~100 GB (8B model) ~8–24 hours 3–5 sweeps needed $500–$2,000
GRPO ~80 GB (8B model) ~8–24 hours 3–5 sweeps needed $400–$1,500
ES (original) ~18 GB (8B model) ~33 hours 1 config for all models $400–$500
ES (accelerated) ~18 GB (8B model) ~3.3 hours 1 config for all models $40–$50

The accelerated ES version is dramatically cheaper than RL alternatives, primarily because: 1. No hyperparameter sweeps needed (one config works for all models) 2. Lower memory footprint enables smaller/fewer GPUs 3. vLLM-based inference is highly optimized 4. No backpropagation overhead

Scaling Properties

Parameter Cost Impact Scaling
Model size (d) Linear Larger models = longer inference
Population size (N) Linear More perturbations = more evaluations
Iterations (T) Linear More iterations = longer training
GPU count Inverse linear Parallelism reduces wall time
Training data Sublinear Batch size matters, not dataset size

9 Architecture Solution

Algorithm Architecture

The ES fine-tuning system has a clean, modular architecture:

┌─────────────────────────────────────────────────────────────────┐
│                    ES FINE-TUNING ARCHITECTURE                  │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    OUTER LOOP (T iterations)             │   │
│  │                                                          │   │
│  │  ┌──────────────────────────────────────────────────┐    │   │
│  │  │           INNER LOOP (N perturbations)           │    │   │
│  │  │                                                  │    │   │
│  │  │  For n = 1 to N (parallelizable):                │    │   │
│  │  │  ┌─────────────────────────────────────────────┐ │    │   │
│  │  │  │  1. Sample seed sₙ                          │ │    │   │
│  │  │  │  2. Perturb θ in-place (layer by layer):    │ │    │   │
│  │  │  │     For each layer ℓ:                       │ │    │   │
│  │  │  │       εₗ = generate_noise(sₙ, ℓ)           │ │    │   │
│  │  │  │       θₗ += σ · εₗ                          │ │    │   │
│  │  │  │  3. Generate responses (greedy decoding)    │ │    │   │
│  │  │  │  4. Compute reward Rₙ = R(responses)        │ │    │   │
│  │  │  │  5. Restore θ in-place (layer by layer):    │ │    │   │
│  │  │  │     For each layer ℓ:                       │ │    │   │
│  │  │  │       εₗ = generate_noise(sₙ, ℓ)           │ │    │   │
│  │  │  │       θₗ -= σ · εₗ                          │ │    │   │
│  │  │  └─────────────────────────────────────────────┘ │    │   │
│  │  └──────────────────────────────────────────────────┘    │   │
│  │                                                          │   │
│  │  Normalize: R̃ₙ = (Rₙ - mean(R)) / std(R)               │   │
│  │                                                          │   │
│  │  Update (decomposed, layer by layer, seed by seed):      │   │
│  │  For each layer ℓ:                                       │   │
│  │    For each seed sₙ:                                     │   │
│  │      εₗ = generate_noise(sₙ, ℓ)                         │   │
│  │      θₗ += α · (1/N) · R̃ₙ · εₗ                          │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                    EVALUATION                            │   │
│  │  • Periodic evaluation on held-out test set              │   │
│  │  • Checkpoint best-performing model                      │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Parallelization Strategy

The inner loop (N perturbations) is embarrassingly parallel — each perturbation evaluation is independent:

GPU 0                  GPU 1                  GPU 2                  GPU 3
┌──────────────┐      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ Model copy   │      │ Model copy   │      │ Model copy   │      │ Model copy   │
│              │      │              │      │              │      │              │
│ Perturb s₁   │      │ Perturb s₂   │      │ Perturb s₃   │      │ Perturb s₄   │
│ Evaluate R₁  │      │ Evaluate R₂  │      │ Evaluate R₃  │      │ Evaluate R₄  │
│ Restore      │      │ Restore      │      │ Restore      │      │ Restore      │
│              │      │              │      │              │      │              │
│ Perturb s₅   │      │ Perturb s₆   │      │ Perturb s₇   │      │ Perturb s₈   │
│ Evaluate R₅  │      │ Evaluate R₆  │      │ Evaluate R₇  │      │ Evaluate R₈  │
│ Restore      │      │ Restore      │      │ Restore      │      │ Restore      │
│    ...       │      │    ...       │      │    ...       │      │    ...       │
└──────┬───────┘      └──────┬───────┘      └──────┬───────┘      └──────┬───────┘
       │                     │                     │                     │
       └─────────────┬───────┴─────────────┬───────┘                     │
                     │                     │                             │
                     └─────── All-gather rewards ────────────────────────┘
                                    │
                                    ▼
                          Normalize + Aggregate Update
                          (decomposed across layers × seeds)

With gpu_threads > 1, each GPU can evaluate multiple perturbations concurrently using separate CUDA streams, further increasing parallelism.

Accelerated Architecture (vLLM)

The accelerated version replaces the standard HuggingFace inference with vLLM engines:

┌─────────────────────────────────────────────────────────────────┐
│                 ACCELERATED ES ARCHITECTURE                     │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              vLLM Inference Engines (per GPU)            │   │
│  │                                                          │   │
│  │  Engine 0 (GPU 0)    Engine 1 (GPU 1)    ...  Engine K   │   │
│  │  ┌──────────────┐    ┌──────────────┐    ┌────────────┐  │   │
│  │  │ Continuous    │    │ Continuous    │    │ Continuous │  │   │
│  │  │ batching     │    │ batching     │    │ batching   │  │   │
│  │  │ PagedAttention│    │ PagedAttention│    │ PagedAttn  │  │   │
│  │  │ KV cache     │    │ KV cache     │    │ KV cache   │  │   │
│  │  └──────────────┘    └──────────────┘    └────────────┘  │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Weight Perturbation Manager                 │   │
│  │                                                          │   │
│  │  • Perturbs vLLM model weights in-place                 │   │
│  │  • Regenerates KV cache after perturbation              │   │
│  │  • Coordinates across engines                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              TensorBoard Logging                         │   │
│  │                                                          │   │
│  │  • Training curves (reward, accuracy)                   │   │
│  │  • Per-iteration statistics                             │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

The key insight: vLLM's continuous batching and PagedAttention make inference dramatically faster than standard HuggingFace generation, yielding the reported 10× speed-up.

10 Component Breakdown

Core Algorithm Components

Component Function Implementation Detail
Noise Generator Produces Gaussian perturbations for model parameters Uses PyTorch RNG with stored seeds; regenerates identical noise from seeds alone
Layer-Level Perturbation Perturbs and restores model weights in-place Iterates over model layers; temporarily allocates one layer-sized tensor
Reward Evaluator Computes task-specific reward for perturbed model Greedy decoding → parse response → compute binary/composite reward
Reward Normalizer Z-score normalization within iteration R̃ = (R - mean(R)) / std(R) ensures consistent scale across iterations
Decomposed Updater Applies aggregated parameter update Triple-nested loop: layers × seeds × parameters; minimal peak memory
Parallelization Manager Distributes perturbation evaluations across GPUs Hugging Face Accelerate (original) or custom vLLM dispatcher (accelerated)

What Is Deliberately Excluded

The paper is notable for what it removes from the standard ES toolbox:

Standard ES Enhancement Included? Rationale
Rank transformation of rewards No Isolates core algorithm performance
Mirrored sampling (antithetic pairs) No Simplifies implementation
Weight decay No Avoids interference with analysis
Virtual batch normalization No Not applicable to LLM architecture
Adam optimizer for update No Uses simple SGD-style update
CMA (covariance adaptation) No Full covariance intractable at billion-parameter scale

"This design choice isolates the core ES algorithm and demonstrates that strong performance can be achieved without auxiliary enhancements. In future work, each individual enhancement can be explored to further improve performance."

This minimalism is a strength of the paper — it demonstrates that raw ES (without engineering tricks) outperforms heavily tuned RL (with tricks and per-model hyperparameter sweeps).

Seven Implementation Innovations

The paper introduces seven implementation modifications that enable billion-parameter ES:

# Innovation Problem Solved Memory Impact
1 Noise retrieval with random seeds Storing N noise vectors of d dimensions is prohibitive Store N integers instead of N×d floats
2 Parallel evaluations Sequential evaluation of N perturbations is slow Assign seeds to GPUs; embarrassingly parallel
3 Layer-level in-place perturbation Allocating a full noise tensor (billions of floats) is prohibitive Only one layer-sized tensor in memory at a time
4 Reward normalization Reward scales vary across iterations and tasks Z-score ensures consistent gradient magnitudes
5 Greedy decoding Stochastic decoding confounds parameter-space and action-space exploration Deterministic evaluation isolates parameter-space effects
6 Decomposed parameter update Aggregating updates requires materializing full-size tensors Layer × seed decomposition minimizes peak memory
7 Learning rate digestion σ in the update equation adds a redundant hyperparameter Absorb 1/σ into α, simplifying tuning

11 Core Mechanisms (Detailed)

Mechanism 1: Natural Evolution Strategies (Simplified)

The theoretical foundation is Natural Evolution Strategies (NES), which views the optimization as operating on the search distribution rather than individual solutions.

Standard NES formulation:

The objective is to maximize the expected reward under a parameterized search distribution (\pi_\psi):

[J(\psi) = \mathbb{E}{\theta \sim \pi\psi}[R(\theta)]]

For a Gaussian search distribution (\pi_\psi = \mathcal{N}(\mu, \Sigma)), the natural gradient update on (\mu) (with fixed (\Sigma = \sigma^2 I)) reduces to:

[\mu \leftarrow \mu + \alpha \cdot \frac{1}{N} \sum_{n=1}^{N} R_n \cdot \varepsilon_n]

where (\varepsilon_n \sim \mathcal{N}(0, I)) and (R_n = R(\mu + \sigma \cdot \varepsilon_n)).

Simplifications in this paper:

Standard NES This Paper Effect
Full covariance Σ Fixed Σ = σ²I Eliminates O(d²) covariance estimation
Natural gradient on Σ No adaptation of Σ σ is fixed hyperparameter
Large population (10,000+) N = 30 Dramatically reduces per-iteration cost
1/σ in update equation Absorbed into α One fewer hyperparameter

Mechanism 2: Memory-Efficient Perturbation via Seed-Based Noise

The most critical engineering innovation is the use of random seeds to represent perturbation noise implicitly:

Traditional approach (infeasible):
  Store N noise vectors, each of dimension d
  Memory: N × d × 4 bytes (float32)
  For d = 8B, N = 30: 30 × 8×10⁹ × 4 = 960 GB  ← IMPOSSIBLE

Seed-based approach (this paper):
  Store N random seeds (integers)
  Memory: N × 8 bytes (int64) = 240 bytes  ← TRIVIAL

  To apply perturbation for seed sₙ:
    rng = torch.Generator().manual_seed(sₙ)
    for each layer ℓ:
      εₗ = torch.randn(layer_ℓ.shape, generator=rng)
      layer_ℓ.data += σ · εₗ
      del εₗ  # only one layer-sized tensor at a time

  To restore:
    rng = torch.Generator().manual_seed(sₙ)  # SAME seed!
    for each layer ℓ:
      εₗ = torch.randn(layer_ℓ.shape, generator=rng)
      layer_ℓ.data -= σ · εₗ  # subtract to restore
      del εₗ

The key insight is that torch.randn with a fixed seed produces identical noise each time. By storing only the seed, the system can regenerate the exact noise for: 1. Applying the perturbation (add σ·ε) 2. Restoring the original parameters (subtract σ·ε) 3. Computing the parameter update (weight by R̃ₙ)

Mechanism 3: Decomposed Parameter Update

The standard update equation requires materializing the full update vector:

Standard: Δθ = α · (1/N) · Σₙ R̃ₙ · εₙ

This requires materializing Σₙ R̃ₙ · εₙ, which is d-dimensional.
For d = 8B: 32 GB in float32.

Decomposed (this paper):
  For each layer ℓ:
    For each seed sₙ:
      εₗ = generate_noise(sₙ, ℓ)
      θₗ += α · (1/N) · R̃ₙ · εₗ
      del εₗ

Peak memory: one layer-sized tensor (max ~2 GB for largest layers)

This decomposition exploits the linearity of addition: the order of summation doesn't matter, so we can accumulate the update layer-by-layer and seed-by-seed, never materializing the full d-dimensional update vector.

Mechanism 4: Greedy Decoding for Deterministic Evaluation

The use of greedy decoding during evaluation is a deliberate methodological choice:

With stochastic decoding (temperature > 0):
  Same model parameters → different responses each time
  Source of variation: parameter perturbation + sampling randomness
  Cannot attribute performance difference to parameter change alone

With greedy decoding (temperature = 0):
  Same model parameters → identical response every time
  Source of variation: parameter perturbation only
  Clean attribution: any performance difference is due to the perturbation

This is analogous to controlled experiments in science — by eliminating one source of variation (decoding randomness), the system can attribute performance differences purely to the parameter perturbation, enabling a cleaner gradient estimate.

Mechanism 5: Z-Score Reward Normalization

Reward normalization is critical for stable training:

Raw rewards (iteration t): [0.0, 0.0, 1.0, 0.0, ..., 1.0]  (binary)
Mean: 0.2, Std: 0.4

Normalized: [-0.5, -0.5, 2.0, -0.5, ..., 2.0]

Effect on update:
  Perturbations that led to correct answers (R̃ > 0) are reinforced
  Perturbations that led to incorrect answers (R̃ < 0) are suppressed
  The magnitude of reinforcement/suppression is proportional to
  how far above/below average the reward is

Without normalization, the update magnitude would depend on the absolute reward scale, which varies across tasks and training stages. Z-score normalization ensures consistent gradient magnitudes, removing a potential source of training instability.

Mechanism 6: Why Small Populations Work

The most surprising finding is that N=30 suffices for billion-parameter spaces. The paper does not provide a full theoretical explanation, but the results suggest several hypotheses:

  1. Effective dimensionality is much lower than nominal dimensionality. LLM parameter spaces are highly structured — most directions in parameter space have minimal impact on output. The 30 perturbations may explore the effective subspace efficiently.

  2. The pre-trained model provides a strong initialization. ES is not searching from scratch — it is fine-tuning from a pre-trained model that already has the right structure. Small perturbations around this initialization are sufficient to discover improved parameters.

  3. Response-level reward provides a strong signal. For binary rewards (correct/incorrect), even a small population can identify which perturbations are beneficial because the reward signal is clear and unambiguous.

  4. Greedy decoding amplifies perturbation effects. Without sampling noise, even small parameter changes produce detectable output differences, making each perturbation informative.

12 Programming Language

Implementation Stack

Component Language Framework Notes
ES algorithm Python 3.10+ PyTorch Core training loop, noise generation, update
Parallelization Python Hugging Face Accelerate Multi-GPU distribution (original version)
Accelerated inference Python vLLM High-throughput inference engine (accelerated version)
Reward computation Python Custom Task-specific reward functions
Logging Python TensorBoard Training curves and per-iteration stats (accelerated)
Model loading Python Hugging Face Transformers Loading pre-trained LLMs

Code Structure

es-fine-tuning-paper/
├── es_fine-tuning_conciseness.py         # Conciseness task (correlated noise)
├── es_fine-tuning_conciseness_iid.py     # Conciseness task (i.i.d. noise)
├── es_fine-tuning_countdown_accl.py      # Countdown (accelerated, vLLM)
├── countdown/
│   ├── es_fine-tuning_countdown.py       # Countdown task (correlated noise)
│   └── es_fine-tuning_countdown_iid.py   # Countdown task (i.i.d. noise)
└── requirement.txt                        # Dependencies

Dependency Analysis

Key dependencies include:

Package Purpose Version Constraint
torch Tensor operations, noise generation, model manipulation ≥ 2.0
transformers Model loading, tokenization Recent
accelerate Multi-GPU distribution Recent
vllm High-throughput inference (accelerated version) 0.11.0
tensorboard Training visualization (accelerated version) Any

Code Complexity

The implementation is notably compact:

File Estimated LOC Complexity
es_fine-tuning_conciseness.py ~300–500 Medium
countdown/es_fine-tuning_countdown.py ~300–500 Medium
es_fine-tuning_countdown_accl.py ~500–800 Medium–High (vLLM integration)
Total ~1,100–1,800 Medium

This compactness is a strength — the entire ES fine-tuning pipeline fits in a single readable file, making the algorithm transparent and easy to modify.

Why Python?

Python is the obvious choice for several reasons: 1. PyTorch native. All tensor operations, GPU management, and model manipulation use PyTorch 2. HuggingFace ecosystem. Model loading, tokenization, and the Accelerate library are Python-native 3. vLLM. The accelerated version uses vLLM, which is a Python library 4. Research community. Python is the lingua franca of ML research

The per-iteration overhead of Python is negligible compared to the GPU computation time for inference and noise generation.

13 Memory Management

GPU Memory Analysis

The paper's most significant practical contribution is its memory efficiency. The analysis below compares ES and RL memory requirements for an 8B-parameter model in bf16 precision:

RL Fine-Tuning (PPO) Memory Breakdown:

Component Size (GB) Notes
Model parameters (bf16) 16 8B × 2 bytes
Gradient buffers (fp32) 32 8B × 4 bytes (full precision gradients)
Adam optimizer states (m, v) 64 2 × 8B × 4 bytes
Activation cache 8–32 Depends on batch size, sequence length
Reference model (KL penalty) 16 Full copy of base model
Total 136–160 Requires 2–4 × A100-80GB

ES Fine-Tuning Memory Breakdown:

Component Size (GB) Notes
Model parameters (bf16) 16 8B × 2 bytes
Layer-sized noise tensor (temp) 0.1–2 Largest single layer; allocated/freed per layer
Random seeds < 0.001 30 integers
Inference KV cache 2–4 For greedy decoding
Total 18–22 Fits on 1 × A100-80GB

Memory reduction: ~7–8×

Memory-Efficient Operations

Three operations dominate the memory management strategy:

1. In-Place Perturbation:

# Perturb model in-place (layer by layer)
for layer in model.layers:
    noise = torch.randn(layer.shape, generator=rng, device=device)
    layer.data.add_(noise, alpha=sigma)  # in-place!
    del noise  # free immediately

2. In-Place Restoration:

# Restore model in-place (regenerate same noise)
rng.manual_seed(seed)  # reset to same seed
for layer in model.layers:
    noise = torch.randn(layer.shape, generator=rng, device=device)
    layer.data.sub_(noise, alpha=sigma)  # subtract = restore
    del noise

3. Decomposed Update:

# Update in-place (layer by layer, seed by seed)
for layer_idx, layer in enumerate(model.layers):
    for seed_idx, (seed, reward) in enumerate(zip(seeds, normalized_rewards)):
        rng.manual_seed(seed)
        # skip to correct layer
        for skip in range(layer_idx):
            torch.randn(model.layers[skip].shape, generator=rng)
        noise = torch.randn(layer.shape, generator=rng, device=device)
        layer.data.add_(noise, alpha=lr * reward / N)  # in-place!
        del noise

Memory Scaling with Model Size

Model Parameters RL Memory ES Memory Savings
0.5B 0.5B ~12 GB ~2 GB
1.5B 1.5B ~25 GB ~4 GB
3B 3B ~48 GB ~8 GB
7B 7B ~110 GB ~16 GB
8B 8B ~140 GB ~18 GB

The memory savings scale better for larger models because RL's gradient and optimizer state overhead grows with parameter count, while ES's overhead remains constant (one layer-sized tensor).

Comparison with Memory-Efficient RL Alternatives

Method Memory (8B model) Fine-Tuning Quality Parameter Coverage
Full RL (PPO/GRPO) ~140 GB Baseline Full parameters
LoRA RL ~30–40 GB Slightly reduced Low-rank adapters only
QLoRA RL ~20–30 GB Reduced Quantized + low-rank
MeZO (zeroth-order) ~20 GB Poor (below baselines) Full parameters
ES (this paper) ~18 GB Exceeds full RL Full parameters

ES achieves the lowest memory footprint while maintaining the highest fine-tuning quality — a combination that no prior method achieved.

14 Continued Learning

Within-Run Learning Dynamics

The ES optimization trajectory exhibits distinctive learning dynamics compared to RL:

RL learning curve (typical):

Accuracy
│        ┌─── plateau / oscillation ───┐
│       /                               \____
│      /                                     \── reward hacking begins
│     /
│    /
│   /  rapid initial improvement
│  /
│ /
└─────────────────────────────────────── Iterations

ES learning curve (typical):

Accuracy
│                         ┌──── continued gradual improvement
│                        /
│                       /
│                      /
│                    /
│                  /
│               /
│            /
│         /
│      /    steady, monotonic improvement
│   /
│  /
│ /
└─────────────────────────────────────── Iterations

ES exhibits slower but more steady improvement without the plateaus, oscillations, or reward hacking that characterize RL fine-tuning. The learning curve is more predictable, enabling better estimation of required compute budgets.

Population Dynamics

Unlike population-based evolutionary systems (AlphaEvolve, OpenEvolve), ES in this paper maintains a single model with a distribution around it:

Iteration t:                  Iteration t+1:
                              (center shifted toward high-reward direction)

  ·   ·                         ·   ·
 · · · ·                       · · · ·
· · θₜ · ·    ────►           · · θₜ₊₁ · ·
 · · · ·                       · · · ·
  ·   ·                         ·   ·

The distribution (cloud) moves through parameter space.
Individual perturbations are transient — only the center persists.

This is fundamentally different from population-based approaches where multiple diverse solutions coexist. The trade-off:

Property ES (Single Center + Distribution) Population-Based (AlphaEvolve)
Diversity Low (Gaussian around one point) High (multiple diverse solutions)
Memory Very low High (N full models)
Exploration Local (radius σ around center) Global (multiple starting points)
Risk of local optima Higher Lower
Implementation complexity Very low High

Cross-Task Transfer

The paper does not investigate cross-task transfer — each experiment starts from a pre-trained model and fine-tunes for a single task. However, the results suggest several transfer possibilities:

  1. Sequential fine-tuning. An ES-tuned model for one task could serve as the starting point for ES fine-tuning on a second task.
  2. Multi-task reward. A composite reward function combining multiple task metrics could enable simultaneous multi-task fine-tuning.
  3. Curriculum learning. Starting with an easy task (high reward signal) and progressively adding harder tasks could improve sample efficiency.

None of these are explored in the paper, representing opportunities for future work.

Continued Improvement Beyond Reported Results

The paper's accelerated version (10× speed-up) suggests that the original results may not represent the limit of ES performance. With 10× faster iterations, significantly more iterations become practical, potentially yielding better final performance.

The paper also notes that standard ES enhancements (mirrored sampling, rank transformation, Adam optimizer) were deliberately excluded. Adding these could further improve performance, and the paper explicitly invites this future work.

15 Applications

Primary Application: LLM Post-Training

The paper positions ES as a general-purpose post-training paradigm for LLMs. Current applications demonstrated:

Application Task Type Reward Structure ES Advantage
Reasoning fine-tuning Symbolic reasoning (Countdown) Binary outcome Long-horizon tolerance, cross-model robustness
Math reasoning GSM8K, MATH500, etc. Binary correctness Competitive with SOTA RL, more stable
Behavioral tuning Conciseness optimization Composite (quality + length) Reward hacking resistance
Puzzle solving Number sequences, logic grids Binary/custom Novel solutions unreachable by base models

Future Applications Suggested by the Paper

  1. RLHF replacement. ES could replace PPO/GRPO in the RLHF pipeline, using human preference reward models but optimizing via ES rather than RL. The reward hacking resistance is particularly valuable for alignment.

  2. Instruction following. Fine-tuning models to follow instructions more precisely, where correctness is binary (followed instruction or didn't).

  3. Code generation. Fine-tuning for code generation tasks where reward is based on test case passage — a naturally long-horizon, binary-outcome task.

  4. Safety alignment. ES's resistance to reward hacking makes it a candidate for safety-critical alignment tasks, where exploiting loopholes in the reward function is a major concern.

  5. Distributed fine-tuning. ES's embarrassingly parallel nature makes it ideal for distributed fine-tuning across multiple machines or even multiple data centers. Only scalar rewards need to be communicated, not gradients.

Implications for the Evolutionary AI Field

This paper has several implications for the broader evolutionary AI landscape:

1. Neuroevolution is back. The paper revives direct parameter-space optimization of neural networks — a field that had been dormant since the early 2010s when gradient-based methods became dominant. By showing that ES works at billion-parameter scale, it reopens research directions that were considered closed.

2. Zeroth-order optimization is viable. The success of ES (a zeroth-order method) at scale challenges the assumption that gradient information is necessary for efficient optimization of LLMs. This opens the door to other zeroth-order methods (CMA-ES, random search, simulated annealing) being applied to LLMs.

3. Backpropagation is not always necessary. For post-training (as opposed to pre-training), backpropagation may not be the optimal optimization strategy. ES's ability to avoid backprop has practical benefits (memory, simplicity) without sacrificing quality.

4. Small populations suffice. The N=30 finding challenges the conventional wisdom in evolutionary computation that population size must scale with problem dimensionality. This has implications for all population-based optimization methods applied to high-dimensional problems.

Relevance to OmniEvolve

This paper is highly relevant to the OmniEvolve project from multiple angles:

OmniEvolve Component Relevance Integration Potential
Search backends ES could be a search backend for parameter-space optimization Implement as ESSearchBackend with layered perturbation
Mutation operators Gaussian perturbation is a principled mutation operator Complement LLM-based mutations with ES-style perturbations
Evaluation Greedy decoding + binary reward is a clean evaluation pattern Adopt for tasks with binary correctness
Memory management Seed-based noise storage is an efficient memory pattern Apply to candidate storage in general
Benchmarks Countdown task is a well-defined benchmark Include in benchmark suite

Limitations

The paper acknowledges several limitations:

  1. Scale ceiling unknown. The largest model tested is 8B parameters. Whether ES remains effective at 70B or 405B is an open question.
  2. Convergence speed. ES is slower per-iteration than RL for some tasks. The total compute may be higher, even if the per-run cost is lower (due to hyperparameter stability).
  3. Exploration radius. With fixed σ, the exploration radius is limited. Adaptive σ (as in CMA-ES) could improve performance but adds complexity.
  4. No theoretical guarantees. The paper provides empirical evidence but no convergence guarantees for ES in billion-parameter spaces. The theoretical analysis of Vemula et al. (2019) would predict poor performance, which is empirically falsified but not theoretically explained.
  5. Limited to post-training. ES is applied to fine-tuning from pre-trained models, not to pre-training from scratch. Whether ES could scale to pre-training is an open question.

Impact Assessment

Dimension Assessment
Scientific novelty Very High — first successful full-parameter ES at billion scale
Practical utility High — cheaper, more stable, no reward hacking
Reproducibility High — full code, public models, fixed hyperparameters
Generality High — works across model families, sizes, and tasks
Theoretical depth Medium — empirical strength, limited theoretical explanation
Community adoption Growing — 340+ stars, ICML acceptance, active discussions
Long-term impact Potentially transformative — could reshape post-training paradigm

Position in the Evolutionary AI Landscape

Parameter Space ─────────────────────────────────────► Action/Code Space
    │                                                        │
    │  ES at Scale              AlphaEvolve / FunSearch       │
    │  (this paper)             (Google DeepMind)             │
    │  ┌───────────┐            ┌─────────────────────┐      │
    │  │ Perturb   │            │ LLM generates code  │      │
    │  │ model     │            │ mutations; evaluator │      │
    │  │ weights   │            │ scores code quality  │      │
    │  │ directly  │            │                     │      │
    │  └───────────┘            └─────────────────────┘      │
    │                                                        │
    │  Sakana Model Merging     EvoPrompting                  │
    │  ┌───────────┐            ┌─────────────────────┐      │
    │  │ Evolve    │            │ Evolve prompts /    │      │
    │  │ merging   │            │ prompt templates    │      │
    │  │ weights   │            │                     │      │
    │  └───────────┘            └─────────────────────┘      │
    │                                                        │
Low-level ◄─────────────────────────────────────── High-level
(continuous)                                      (discrete/symbolic)

ES at Scale occupies the lowest-level, most direct position in this landscape — it evolves the raw parameters of the model, without any abstraction layer (code, prompts, merging weights). This directness is both its strength (no information loss through abstraction) and its limitation (exploration is local in parameter space).