Introduced2024-12
Score8.0/10 — Draft
Chapter 8

LLM4AD: A Unified Platform for LLM-Based Automatic Algorithm Design

Part P02: General-Purpose Evolutionary Frameworks

8.1 Overview & Motivation

The rapid proliferation of LLM-powered evolutionary systems between 2024 and 2026 produced a fragmentation problem: each published method—EoH, FunSearch, ReEvo, LLaMEA, and others—shipped its own codebase, evaluation harness, LLM integration layer, and logging infrastructure. Researchers wanting to fairly compare these methods faced a significant engineering burden: reimplementing or adapting each system independently, reconciling incompatible APIs, and ensuring that performance differences reflected algorithmic merit rather than implementation artifacts.

LLM4AD (Large Language Models for Automatic Algorithm Design), developed by researchers at City University of Hong Kong and Southern University of Science and Technology, directly addresses this fragmentation. Released under the MIT license and supporting Python 3.9–3.12, LLM4AD integrates seven search methods into a single platform with a shared evaluation engine, unified LLM abstraction layer, and consistent logging infrastructure. The repository is publicly available at github.com/Optima-CityU/llm4ad.

Key Contribution

LLM4AD is the first open-source platform to unify multiple published LLM-based algorithm design methods (EoH, MEoH, FunSearch, ReEvo, MCTS-AHD, (1+1)-EPS, LLaMEA) under a common evaluation and LLM abstraction layer, enabling controlled cross-method comparison on 12+ combinatorial optimization and scientific discovery tasks. Its principal contribution is infrastructure for fair comparison rather than a novel search algorithm.

Repository version and verification scope. This chapter describes LLM4AD as documented in the repository README, published papers (Liu et al., ICML 2024; Liu et al., AAAI 2025), and PyPI package listing available through early 2026. The repository does not use semantic versioning tags or publish a changelog; consequently, no claims in this chapter are pinned to a specific commit or PyPI release. All descriptions of the platform’s public API, class names, constructor parameters, and module paths are drawn from the repository README documentation, not from source-code inspection or execution-validated testing. Where the chapter presents code examples, these are reproduced from README documentation unless otherwise noted. Internal implementation details—class hierarchies, method dispatch, serialization formats, and evaluation engine internals—have not been verified against source files. Readers intending to use the platform should consult the repository at the time of use and verify interfaces against the installed package version.

Throughout this chapter, we use the following evidence-level markers:
  • [README] — reproduced from repository README or documented API reference
  • [paper] — from the original publication of the search method
  • [inferred] — inferred from documentation patterns but not directly stated or code-verified

8.1.1 The Fair Comparison Problem

When different methods use different LLM backends, different evaluation budgets, and different scoring implementations, observed performance gaps are confounded by implementation choices. A method that appears superior may simply benefit from a stronger LLM, more evaluations, or a more favorable scoring function. LLM4AD's central design goal is to eliminate these confounds by providing shared infrastructure where all methods can be evaluated under identical conditions—same LLM, same hardware, same scoring code.

It is important to note, however, that “identical conditions” applies to the infrastructure layer, not to evaluation budgets or LLM call patterns. As we discuss in Section 8.6, the seven integrated methods have fundamentally different computational profiles: FunSearch uses island-based best-shot sampling that requires thousands of evaluations, while (1+1)-EPS operates with a single-solution greedy strategy that converges in hundreds. The repository-reported benchmark tables use each method’s natural operating budget rather than a fixed common budget, making direct score-based ranking problematic. LLM4AD enables same-infrastructure comparison, but methodologically sound cross-method benchmarking still requires careful experimental design with budget normalization, multiple trials, and statistical reporting—none of which are present in the repository’s published documentation.

8.1.2 Design Philosophy

The platform is organized around four principles stated in its repository README [README]:

  • Modular & Extensible: Search methods, tasks, and LLM backends are described as pluggable modules. Adding a new search method requires implementing a single interface; adding a new task requires an evaluation function and problem specification.
  • Reproducible: All methods share the same evaluation engine, described as ensuring identical process isolation, timing, and scoring. Checkpoint/resume support is documented for continuing interrupted runs.
  • Production-Ready: Multiprocessing with configurable worker counts, per-evaluation timeouts, and integration with Weights & Biases (W&B) and TensorBoard for experiment tracking are documented.
  • Accessible: A web-based GUI mode is documented for interactive experimentation without writing code.

[README] These principles are stated in documentation. The extent to which each is fully realized in the implementation (e.g., whether checkpoint/resume works reliably across all seven methods, or whether the GUI exposes all documented features) has not been independently tested or source-verified.

8.2 Architecture

LLM4AD’s architecture follows a layered design with four primary components as described in the repository documentation: a user interface layer (CLI, Python API, and GUI), an orchestrator that coordinates search methods with tasks and LLM backends, a shared evaluation engine with process isolation, and a logging/checkpointing subsystem. The following diagram illustrates the high-level component relationships as described in the repository documentation.

USER INTERFACE LAYER [README-documented] CLI Python API GUI (Web) YAML / Programmatic Config LLM4AD ORCHESTRATOR Search Methods (7) • EoH (ICML 2024) • MEoH (AAAI 2025) • FunSearch (Nature 2024) • ReEvo (NeurIPS 2024) • MCTS-AHD / EPS / LLaMEA Task Registry (12+) • Bin Packing, TSP, Knapsack • Facility Location, QAP • Scheduling, BO Acquisition • RL Environments • Circle Packing, Cap Sets LLM Backend Manager • OpenAI (GPT-4o, o1, o3) • Google (Gemini Pro/Flash) • DeepSeek (V3, R1) • Anthropic (Claude) • Local: vLLM, Ollama SHARED EVALUATION ENGINE Multiprocessing Pool Timeout Guard Content-Hash Cache Score Collection Process Iso. LOGGING, VISUALIZATION & CHECKPOINTING W&B Integration TensorBoard CSV Export Checkpoints GUI Dashboard All search methods share the same evaluation engine, ensuring identical timing, scoring, and process isolation Diagram reconstructed from repository README documentation at github.com/Optima-CityU/llm4ad

8.2.1 Orchestrator and Documented API

The repository README describes a central orchestrator class that accepts a search method, task, and LLM backend. The following code example is reproduced directly from the README’s “Quick Start: Bin Packing with EoH” section [README]:

# Reproduced from repository README — "Quick Start: Bin Packing with EoH"
# Source: github.com/Optima-CityU/llm4ad README, "Quick Start" section
# STATUS: README-documented example, NOT execution-validated
from llm4ad import LLM4AD, EoH, BinPacking
from llm4ad.llm import OpenAIBackend

llm = OpenAIBackend(model="gpt-4o", api_key="sk-...")

task = BinPacking(
    instance="benchmark/bpp_500_items",
    metric="waste_ratio",
    direction="minimize",
)

method = EoH(
    population_size=20,
    num_generations=50,
    crossover_rate=0.3,
    mutation_rate=0.7,
    elite_size=3,
)

runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    max_workers=4,
    timeout_per_eval=30,
    log_to="wandb",
    project_name="llm4ad-bpp",
)

result = runner.run()
print(f"Best waste ratio: {result.best_score:.4f}")
print(f"Best algorithm:\n{result.best_code}")
API verification status: README-documented only. This code is reproduced verbatim from the repository README. Whether the exact import paths (from llm4ad import LLM4AD, EoH, BinPacking), constructor parameter names (population_size, crossover_rate, etc.), and return-object attributes (result.best_score, result.best_code) match the installed package at any given commit has not been verified against source code. The README does not specify whether the API is considered stable, and no __all__ export surface or formal versioning policy has been identified. The EoH constructor in the README’s quick-start section shows five parameters; a separate README section (see §8.3.1) shows additional parameters (thought_mutation_temp, code_mutation_temp, tournament_size) that may or may not be present in all versions.

The documentation also describes a resume() class method for checkpoint recovery [README] and a GUI launch mode via llm4ad gui --port 8080 or the Python function llm4ad.gui.launch(port=8080) [README]. Whether these entry points exist in the installed package, function as described, or have additional undocumented requirements has not been independently tested.

8.2.2 Documented Plugin Architecture

The platform uses a registry pattern for its three extension points, as described in the repository documentation [README]. Based on the documented API reference sections:

  • Search methods implement a common interface. The README lists seven built-in methods importable from llm4ad.methods: EoH, MEoH, FunSearch, ReEvo, MCTSAHD, EPS, and LLaMEA [README].
  • Tasks are defined with an evaluation function and problem specification. The documented Task base class includes fields for name, instance, metric, direction, function_signature, imports, and timeout. The documented interface provides evaluate(code: str) -> EvalResult and get_prompt_context() -> str methods [README].
  • LLM backends implement an LLMBackend interface with generate(prompt: str, **kwargs) -> LLMResponse and get_model_info() -> dict methods [README].

This design means that adding a new search method, task, or backend is documented as requiring exactly one interface implementation without modifying platform core code. However, the exact internal class hierarchy, abstract base classes, registration mechanism, and whether all documented interfaces are enforced at runtime have not been verified against source code—the description above is based entirely on the documented API reference in the README.

8.2.3 Evaluation Engine

A critical architectural decision is that all search methods share the same evaluation engine. Based on the repository documentation [README], this engine provides:

  • Process isolation: Each candidate evaluation runs in a separate process via Python’s multiprocessing module. This is documented as preventing infinite loops or memory leaks in generated code from affecting the main orchestrator. This is process-level isolation, not container or VM sandboxing—generated code runs with the same OS-level permissions as the parent process (see Section 8.8.4).
  • Timeout enforcement: Each evaluation has a configurable timeout. The README examples use 30–120 seconds depending on task complexity [README].
  • Content-hash caching: Evaluation results are documented as cached by a hash of the generated code. The hashing algorithm and cache eviction policy are not specified [inferred from README architecture description].
  • Configurable parallelism: The max_workers parameter controls the number of concurrent evaluation processes, with documented example values of 4–16 [README].

The shared evaluation engine is what makes cross-method comparison meaningful at the infrastructure level: when EoH and ReEvo are compared on the same bin-packing task, they use identical scoring code, identical timeout settings, and identical process isolation. The difference in observed scores can therefore be attributed to the search method and its budget rather than evaluation infrastructure—assuming the evaluation engine’s implementation matches its documentation.

8.3 Integrated Search Methods

LLM4AD integrates seven search methods published at top venues between 2024 and 2025. This section describes each method in two parts: first, the core mechanism as described in the original publication, and second, how the method is exposed within LLM4AD. This separation is critical because LLM4AD reimplements each method against its own interfaces; differences between the LLM4AD version and the original codebase may exist in prompt construction, selection logic, hyperparameter defaults, or code parsing.

8.3.0 Method Implementation Fidelity

The following table summarizes the provenance, known deviations, and verification status of each integrated method. The “Verified aspect” column identifies at least one concrete implementation detail that could be checked against both the original paper/repo and LLM4AD’s documentation. The “Status” column indicates whether that detail appears to be faithfully reproduced (same), intentionally or observably different (modified), or not determinable from public materials (unknown).

Method Source Paper Original Repo Verified Aspect Status Evidence
EoH Liu et al., ICML 2024 github.com/FeiLiu36/EoH Operator set (e1, e2, c1, c2) Same (conceptual) README lists four operators matching paper §3 [README, paper]. LLM4AD adds separate thought_mutation_temp/code_mutation_temp parameters not in original paper (modified).
MEoH Liu et al., AAAI 2025 Same group as EoH Pareto selection + crowding distance Same (conceptual) README documents non-dominated sorting and crowding distance consistent with AAAI 2025 paper [README, paper]. Whether objective-aware mutation matches paper’s formulation is unknown.
FunSearch Romera-Paredes et al., Nature 2024 github.com/google-deepmind/funsearch LLM backend Modified Original uses PaLM 2; LLM4AD substitutes configurable backend [README, paper]. Island structure conceptually preserved, but LLM4AD adds migration_interval parameter not prominent in original paper (modified). Score-tiered database internals: unknown.
ReEvo Ye et al., NeurIPS 2024 github.com/ai4co/reevo Reflection loop structure Same (conceptual) README documents generate–evaluate–reflect–refine cycle matching paper §3 [README, paper]. Whether short_term_memory/long_term_memory parameters map to paper’s memory architecture: unknown.
MCTS-AHD ICML 2025 Not identified Backpropagation strategy Unknown LLM4AD exposes backprop_strategy="max" [README]; whether this matches the original paper’s recommendation cannot be verified without access to the original implementation.
(1+1)-EPS PPSN 2024 Not identified Core (1+1) greedy acceptance Same (conceptual) Single-solution hill climbing with greedy acceptance is simple enough that faithful reimplementation is likely [README, paper]. Restart mechanism details: unknown.
LLaMEA van Stein & Bäck, IEEE TEVC 2025 github.com/nikivanstein/LLaMEA Population strategy Modified Original paper uses a (1+1) strategy [paper]; LLM4AD exposes population_size=10 [README], suggesting a population-based extension. mutation_strategy="component_swap" is not described in the original paper (modified).
Interpreting this table. “Same (conceptual)” means the high-level algorithmic mechanism documented in LLM4AD’s README matches the original paper’s description, but the actual implementation has not been compared at the code level. Prompt templates, hyperparameter defaults, parsing logic, and error handling may differ in ways that affect search behavior. “Modified” indicates a documented or observable deviation. “Unknown” means the public materials are insufficient to determine agreement.

This fidelity gap is the single most important caveat for interpreting LLM4AD’s cross-method benchmark results. If the reimplemented version of, say, FunSearch behaves differently from DeepMind’s original due to differences in prompt engineering, sampling strategy, or island management, then LLM4AD’s “FunSearch” results reflect the platform’s reimplementation rather than the published method. No method in LLM4AD’s documentation includes a fidelity comparison against the original codebase.

8.3.1 EoH — Evolution of Heuristics (ICML 2024)

Original Method (Liu et al., ICML 2024)

EoH’s distinguishing feature is thought-code co-evolution: each individual in the population is a pair $(t_i, c_i)$ where $t_i$ is a natural-language description of an algorithmic idea (the “thought”) and $c_i$ is its executable Python implementation [paper]. Mutation operates on both representations: the thought is mutated first using the LLM, then the code is regenerated to match the new thought. This dual representation allows the LLM to reason at the level of algorithmic concepts while grounding that reasoning in executable code.

EoH defines four evolutionary operators [paper, §3]:

OperatorTypeInputOutput
e1Thought mutationParent thought $t_i$ + task description $\tau$New thought $t'$ + new code $c'$
e2Code mutationParent thought $t_i$ + parent code $c_i$Same thought $t_i$ + modified code $c'$
c1Thought crossoverTwo parent thoughts $t_i, t_j$Merged thought $t'$ + new code $c'$
c2Code crossoverTwo parent pairs $(t_i, c_i), (t_j, c_j)$Combined thought $t'$ + combined code $c'$

Selection uses tournament selection with elitism [paper]. Given a population $P = \{(t_1, c_1), \ldots, (t_N, c_N)\}$ with fitness values $f_i = \text{evaluate}(c_i)$, the top-$k$ individuals (the “elite”) are preserved directly into the next generation, while the remaining slots are filled by tournament selection of size $s$.

Tournament selection probability. Consider a tournament of size $s$ drawn uniformly at random without replacement from a population of $N$ individuals. Let individuals be ranked by fitness, with rank $r = 1$ denoting the best. The individual with rank $r$ wins the tournament if and only if (i) it is included in the tournament and (ii) no individual with a better rank (i.e., rank $< r$) is also included. The number of size-$s$ subsets satisfying both conditions is $\binom{N - r}{s - 1}$: we must include the rank-$r$ individual and choose the remaining $s - 1$ members from the $N - r$ individuals with rank $> r$. Dividing by the total number of possible tournaments gives:

$$P(\text{rank } r \text{ wins}) = \frac{\binom{N-r}{s-1}}{\binom{N}{s}}$$

where $N$ is the population size, $s$ is the tournament size, and the convention $\binom{n}{k} = 0$ for $k > n$ applies. We can verify two boundary cases:

  • Best individual ($r = 1$): $P = \binom{N-1}{s-1} / \binom{N}{s} = s/N$, consistent with the intuition that the best individual wins whenever it appears in the tournament, which occurs with probability $s/N$.
  • Worst individual ($r = N$, with $s \geq 2$): $P = \binom{0}{s-1} / \binom{N}{s} = 0$, since the worst-ranked individual cannot win any tournament that includes at least one other individual.

The probabilities sum to 1, which can be verified via the hockey-stick identity: $\sum_{r=1}^{N} \binom{N-r}{s-1} = \sum_{j=0}^{N-1}\binom{j}{s-1} = \binom{N}{s}$. Larger tournament sizes increase selection pressure by concentrating probability mass on higher-ranked individuals.

LLM4AD Exposure

The LLM4AD README shows EoH instantiated with the following parameters [README]:

# Reproduced from LLM4AD README — EoH configuration example
# STATUS: README-documented, not execution-validated
method = EoH(
    population_size=20,
    num_generations=50,
    crossover_rate=0.3,
    mutation_rate=0.7,
    elite_size=3,
    thought_mutation_temp=0.9,
    code_mutation_temp=0.7,
    tournament_size=3,
)

[README] Whether these parameter names exactly match the constructor signature at the code level has not been verified. The separation of thought_mutation_temp and code_mutation_temp is a platform-specific exposure not present in the original EoH paper, which does not discuss separate temperature parameters for thought vs. code mutation. The quick-start section of the README shows a simpler five-parameter variant; this extended version appears in a separate configuration reference section.

8.3.2 MEoH — Multi-Objective EoH (AAAI 2025)

Original Method (Liu et al., AAAI 2025)

MEoH extends EoH to multi-objective optimization by replacing scalar fitness with a vector of objectives and using Pareto-based selection [paper]. Each candidate is evaluated on $m$ objectives $\mathbf{f}(c_i) = (f_1(c_i), \ldots, f_m(c_i))$. Selection uses non-dominated sorting: candidates are partitioned into Pareto fronts $F_1, F_2, \ldots$ where $F_1$ contains all non-dominated solutions, $F_2$ contains solutions dominated only by $F_1$, and so on.

Within each front, crowding distance is used to maintain diversity. For a solution $i$ in front $F_k$, the crowding distance is defined as [standard NSGA-II formulation, Deb et al., 2002]:

$$d_{\text{crowd}}(i) = \sum_{j=1}^{m} \frac{f_j(i_{+}) - f_j(i_{-})}{f_j^{\max} - f_j^{\min}}$$

where $i_{+}$ and $i_{-}$ are the nearest neighbors of solution $i$ along objective $j$ (when solutions within the same front are sorted by $f_j$), $f_j^{\max}$ and $f_j^{\min}$ are the maximum and minimum values of objective $j$ in the current population, and $m$ is the number of objectives. Boundary solutions (those with no neighbor on one side along any objective) are assigned infinite crowding distance to ensure frontier extremes are preserved. When all solutions in a front share the same value on objective $j$ (i.e., $f_j^{\max} = f_j^{\min}$), the standard convention is to set that objective’s contribution to zero.

Note on objective direction: when objectives have mixed directions (some maximized, some minimized), the standard approach is to negate minimization objectives so that all objectives are maximized before computing dominance, crowding distance, and hypervolume. Whether LLM4AD’s implementation of MEoH performs this transformation internally when the user specifies directions=["maximize", "minimize"] is not documented.

Progress is tracked via the hypervolume indicator relative to a user-specified reference point $\mathbf{r} \in \mathbb{R}^m$ [paper]:

$$\text{HV}(P, \mathbf{r}) = \lambda\!\left(\bigcup_{i \in P_{\text{nd}}} [\mathbf{f}(c_i), \mathbf{r}]\right)$$

where $P_{\text{nd}} \subseteq P$ is the set of non-dominated solutions, $\lambda(\cdot)$ denotes the Lebesgue measure (volume), and $[\mathbf{a}, \mathbf{b}]$ denotes the axis-aligned hyperrectangle with corners $\mathbf{a}$ and $\mathbf{b}$. The reference point $\mathbf{r}$ must be dominated by all solutions in $P_{\text{nd}}$ for the hypervolume to be well-defined and positive. A key MEoH-specific feature is objective-aware mutation: the LLM prompt includes information about which objectives are lagging, directing the mutation toward improving underperforming dimensions [paper].

LLM4AD Exposure

# Reproduced from LLM4AD README — MEoH configuration example
# STATUS: README-documented, not execution-validated
method = MEoH(
    population_size=30,
    num_generations=100,
    objectives=["quality", "runtime"],
    directions=["maximize", "minimize"],
    crossover_rate=0.3,
    mutation_rate=0.7,
    reference_point=[0.0, 1000.0],
    objective_focus_strategy="lagging",
)

[README] Whether the objective_focus_strategy="lagging" parameter directly corresponds to the objective-aware mutation described in the AAAI 2025 paper, or represents a platform-specific simplification, is not documented.

8.3.3 FunSearch (Nature 2024)

Original Method (Romera-Paredes et al., Nature 2024)

FunSearch, originally developed by DeepMind and published in Nature, uses best-shot sampling with an island model [paper]. Rather than maintaining a population and applying evolutionary operators, FunSearch generates many candidates from the LLM in parallel and retains only the highest-scoring ones. The key architectural components are:

  • Programs Database: An island-structured store organized by score tiers. Each island maintains its own population of candidates grouped into score buckets [paper, Extended Data Fig. 1].
  • Sampler Pool: Multiple LLM samplers generate candidates in parallel using few-shot prompts constructed from the top-$k$ candidates of a randomly selected island [paper].
  • Island Reset: Periodically, the lowest-performing island is reset (its population cleared), preventing convergence to local optima across the entire database [paper].

The core loop is a generate–evaluate–store cycle. At each step, the system selects a random island, constructs a prompt from the island’s top-$k$ programs, generates multiple candidate programs from the LLM, evaluates each candidate, and inserts passing candidates into the originating island’s score-tiered database.

Algorithm Pseudocode

# Algorithm: FunSearch core loop
# PSEUDOCODE based on the Nature 2024 paper description (Romera-Paredes et al.)
# NOT from LLM4AD source code — simplified for exposition

def funsearch_loop(islands, sampler_pool, evaluator, config):
    """
    Pseudocode for FunSearch's iterative sample-evaluate-store cycle.
    Each island maintains a score-tiered programs database.
    """
    for step in range(config.max_evaluations):
        # 1. Select a random island
        island = random_choice(islands)

        # 2. Sample top-k programs from the island as few-shot exemplars
        exemplars = island.sample_top_k(k=config.top_k_for_prompt)

        # 3. Construct prompt and generate candidates
        prompt = format_few_shot_prompt(exemplars, config.task_spec)
        candidates = sampler_pool.generate(prompt, n=config.samples_per_prompt)

        # 4. Evaluate and insert
        for candidate in candidates:
            score = evaluator.evaluate(candidate)
            if score is not None:          # passed syntax + timeout checks
                island.insert(candidate, score)

        # 5. Periodic island reset
        if step % config.reset_period == 0:
            worst_island = argmin(islands, key=lambda i: i.best_score())
            worst_island.reset()

LLM4AD Exposure

# Reproduced from LLM4AD README — FunSearch configuration example
# STATUS: README-documented, not execution-validated
method = FunSearch(
    num_islands=10,
    programs_per_island=50,
    num_samplers=4,
    samples_per_prompt=4,
    temperature=1.0,
    top_k_for_prompt=2,
    reset_period=100,
    migration_interval=50,
)

[README] The original FunSearch paper uses PaLM 2 as its LLM; LLM4AD substitutes whatever backend is configured (known modification). The migration_interval parameter is documented in LLM4AD but is not a prominent feature of the original FunSearch paper, suggesting this may be a platform-specific addition (possible modification). The original paper’s programs database uses score-based clustering within islands; whether LLM4AD reproduces this exact internal structure is unknown.

8.3.4 ReEvo — Reflective Evolution (NeurIPS 2024)

Original Method (Ye et al., NeurIPS 2024)

ReEvo introduces a reflection mechanism where the LLM not only generates candidate mutations but also analyzes evaluation results to understand why a candidate succeeded or failed [paper]. This self-reflective capability produces more targeted mutations than blind perturbation.

The reflection loop proceeds in four stages per mutation cycle [paper, §3]:

  1. Generate: The LLM produces a candidate heuristic based on the current parent and task description.
  2. Evaluate: The candidate is scored by the evaluation engine, which returns both an aggregate score and per-instance breakdowns.
  3. Reflect: The LLM receives the candidate code, its score, and per-instance breakdown, then produces a structured analysis identifying strengths, weaknesses, and specific failure modes.
  4. Refine: The LLM uses the reflection to propose a targeted improvement, which becomes the next candidate.

ReEvo maintains both short-term memory (recent reflections kept in the LLM context window) and long-term memory (an archive of past reflections that can be retrieved for relevant context) [paper]. The depth of the reflect–refine inner loop is configurable, allowing deeper self-analysis at the cost of additional LLM calls.

LLM4AD Exposure

# Reproduced from LLM4AD README — ReEvo configuration example
# STATUS: README-documented, not execution-validated
method = ReEvo(
    population_size=15,
    num_generations=60,
    reflection_depth=2,
    short_term_memory=5,
    long_term_memory=20,
    reflection_temperature=0.7,
    mutation_temperature=0.8,
)

[README] The original ReEvo paper (Ye et al., NeurIPS 2024) describes the reflection mechanism in terms of prompt structure and memory management. Whether LLM4AD’s short_term_memory and long_term_memory integer parameters directly correspond to the paper’s memory architecture (which involves retrieval-based selection from an archive), or represent a simplified windowed approximation, is unknown.

8.3.5 MCTS-AHD — Monte Carlo Tree Search for Algorithm Design (ICML 2025)

Original Method

MCTS-AHD reframes algorithm design as a sequential decision problem [paper]. Rather than evolving complete algorithms, it constructs them step by step, where each decision point represents a choice of algorithmic component (data structure, operator, control-flow pattern). The search tree is explored using Upper Confidence Bounds applied to Trees (UCT):

$$\text{UCB}(s, a) = \bar{Q}(s, a) + c \cdot \sqrt{\frac{\ln N(s)}{N(s, a)}}$$

where:

  • $\bar{Q}(s, a) = \frac{1}{N(s,a)} \sum_{k=1}^{N(s,a)} r_k(s,a)$ is the mean reward observed after taking action $a$ in state $s$, with $r_k$ being the reward from the $k$-th simulation that passed through $(s,a)$;
  • $N(s) = \sum_{a'} N(s, a')$ is the total visit count for state $s$;
  • $N(s, a)$ is the number of times action $a$ was selected in state $s$;
  • $c > 0$ is the exploration constant. The classic theoretical value is $c = \sqrt{2}$ for rewards bounded in $[0, 1]$ (Kocsis & Szepesvári, 2006); the LLM4AD documentation uses $c = \sqrt{2} \approx 1.414$ [README].

The exploration term $\sqrt{\ln N(s) / N(s, a)}$ grows for under-explored actions, balancing exploitation of known-good component choices against exploration of novel ones.

On UCT convergence. In the classical setting of finite game trees with bounded rewards in $[0,1]$, Kocsis & Szepesvári (2006) proved that UCT converges to the optimal action as $N(s) \to \infty$. However, the algorithm design setting differs from this classical setting in important ways: (1) the action space at each node is not finite in the standard sense—LLM-generated components form a practically unbounded set; (2) rewards are stochastic and non-stationary, since the same partial algorithm completed by different LLM rollouts yields different scores; (3) tree depth is problem-dependent and may not be bounded a priori. For these reasons, the classical convergence guarantee does not directly apply. UCT should be understood here as a practical heuristic for balancing exploration and exploitation in the component-selection tree, not as carrying formal optimality guarantees for the algorithm-design problem.

The four MCTS phases apply to algorithm construction as follows [paper]:

  1. Selection: Starting from the root (empty algorithm specification), traverse the tree by selecting the child with highest UCB value at each node until reaching a leaf or an unexpanded node.
  2. Expansion: At the leaf, use the LLM to generate candidate next components (up to expansion_width children).
  3. Rollout: Complete the partial algorithm using the LLM (generating remaining components greedily) and evaluate the full algorithm. The reward $r$ is the evaluation score normalized to $[0, 1]$.
  4. Backpropagation: Update $\bar{Q}$ and $N$ values along the path from the evaluated leaf back to the root.

LLM4AD Exposure

# Reproduced from LLM4AD README — MCTS-AHD configuration example
# STATUS: README-documented, not execution-validated
method = MCTSAHD(
    max_depth=8,
    num_simulations=200,
    exploration_constant=1.414,
    expansion_width=5,
    rollout_policy="llm",
    backprop_strategy="max",
)

[README] The backprop_strategy="max" parameter suggests that backpropagation uses the maximum score along the path rather than the mean. This would bias exploration toward high-variance branches, which may be appropriate for algorithm design where a single excellent solution matters more than average performance. Whether this matches the original paper’s recommendation is unknown—the original implementation has not been identified.

8.3.6 (1+1)-EPS — Evolutionary Program Search (PPSN 2024)

Original Method

(1+1)-EPS is the simplest method in LLM4AD’s repertoire: it maintains a single solution and applies greedy hill climbing [paper]. At each iteration, the LLM generates a mutation of the current best solution; if the mutation scores at least as well, it replaces the current solution.

Algorithm Pseudocode

# Algorithm: (1+1)-EPS
# PSEUDOCODE based on the PPSN 2024 paper description
# NOT from LLM4AD source code — simplified for exposition

def eps_search(initial_code, evaluate, llm, config):
    """
    Single-solution greedy hill climbing with LLM mutations.
    Accepts equal-or-better mutations (non-strict improvement).
    Restarts from best-so-far after stagnation.
    """
    current_code = initial_code
    current_score = evaluate(current_code)
    best_code, best_score = current_code, current_score
    stagnation_count = 0

    for iteration in range(config.max_iterations):
        # Generate mutation — LLM sees current code + recent feedback
        mutant_code = llm.mutate(
            code=current_code,
            feedback=recent_evaluations(window=config.feedback_window)
        )
        mutant_score = evaluate(mutant_code)

        # Greedy acceptance (non-strict: >= not >)
        if mutant_score is not None and mutant_score >= current_score:
            current_code, current_score = mutant_code, mutant_score
            stagnation_count = 0
            if current_score > best_score:
                best_code, best_score = current_code, current_score
        else:
            stagnation_count += 1

        # Restart from best-so-far if stagnated
        if stagnation_count >= config.stagnation_threshold:
            current_code, current_score = best_code, best_score
            stagnation_count = 0

    return best_code, best_score

Despite its simplicity, (1+1)-EPS provides a useful baseline and can be surprisingly effective on well-defined problems where the fitness landscape has accessible gradients. The method requires the fewest LLM calls per generation (exactly one mutation per iteration) and converges quickly when the initial solution is already good.

LLM4AD Exposure

# Reproduced from LLM4AD README — (1+1)-EPS configuration example
# STATUS: README-documented, not execution-validated
method = EPS(
    max_iterations=200,
    mutation_temperature=0.8,
    include_feedback=True,
    feedback_window=5,
    restart_on_stagnation=True,
    stagnation_threshold=30,
)

8.3.7 LLaMEA (IEEE TEVC 2025)

Original Method (van Stein & Bäck, IEEE TEVC 2025)

While most methods in LLM4AD evolve problem-specific heuristics (e.g., a bin-packing strategy or a TSP construction heuristic), LLaMEA generates entire metaheuristic algorithms—general-purpose optimization procedures like novel variants of particle swarm optimization or differential evolution [paper]. The generated metaheuristic is then evaluated on a portfolio of benchmark problems, and scores are aggregated to reward generality.

The aggregation method is significant [paper]. Given scores $s_1, s_2, \ldots, s_K$ on $K$ portfolio benchmarks, the aggregated fitness uses the geometric mean:

$$F_{\text{agg}} = \left(\prod_{k=1}^{K} \hat{s}_k\right)^{1/K}$$

where each $\hat{s}_k = s_k / s_k^{\text{ref}}$ is normalized relative to a reference baseline score $s_k^{\text{ref}}$ for benchmark $k$, ensuring comparable scales across problems. The geometric mean penalizes methods that perform poorly on any single benchmark: if $\hat{s}_k \to 0$ for any $k$, then $F_{\text{agg}} \to 0$ regardless of performance on other benchmarks. This incentivizes robust generalization rather than specialization on easy problems. Note that this requires $\hat{s}_k > 0$ for all $k$; the handling of zero or negative normalized scores is not specified in the paper.

LLM4AD Exposure

# Reproduced from LLM4AD README — LLaMEA configuration example
# STATUS: README-documented, not execution-validated
method = LLaMEA(
    population_size=10,
    num_generations=30,
    algorithm_template="metaheuristic",
    benchmark_portfolio=[
        "sphere_d10", "rastrigin_d10",
        "rosenbrock_d10", "ackley_d10",
    ],
    aggregation="geometric_mean",
    mutation_strategy="component_swap",
)

[README] The original LLaMEA paper uses a (1+1) strategy (single parent, single offspring) with self-adaptive mutation [paper]; the LLM4AD version exposes a population-based variant with population_size=10, which represents an extension beyond the original paper (known modification). The mutation_strategy="component_swap" parameter is not described in the original paper and appears to be a platform-specific addition (known modification).

8.3.8 Method Comparison Summary

Method Venue Representation Population Key Mechanism Multi-Obj. Reflection
EoHICML 2024Thought + code20–50Thought-code co-evolutionNoNo
MEoHAAAI 2025Thought + code30–100Pareto + objective-aware mutationYesNo
FunSearchNature 2024Code onlyIslandsBest-shot sampling + island modelNoNo
ReEvoNeurIPS 2024Code + reflections10–30Self-reflective mutationNoYes
MCTS-AHDICML 2025Component treeTree nodesUCB-guided compositionNoNo
(1+1)-EPSPPSN 2024Code only1Greedy hill climbingNoPartial
LLaMEATEVC 2025Full metaheuristic10–20Portfolio-evaluated generationNoNo

Population sizes shown are from LLM4AD README examples [README] and may not match original paper defaults. “Partial” reflection for (1+1)-EPS indicates that evaluation feedback is included in the mutation prompt but without a dedicated reflection-analysis step [paper].

8.4 Supported Tasks & Benchmarks

LLM4AD ships with built-in support for twelve or more optimization tasks spanning four domains [README]: combinatorial optimization, machine learning, scientific computing, and mathematical discovery. Each task is defined by an evaluation function, a problem instance specification, a metric, and an optimization direction. The task registry allows researchers to evaluate any search method on any task without writing integration code.

8.4.1 Combinatorial Optimization Tasks

TaskDescriptionMetricBenchmark Instances
Bin PackingPack items into fixed-capacity binsWaste ratio (min)Falkenauer, Scholl, random (50–5000 items)
TSPShortest Hamiltonian tourTour length (min)TSPLIB (14–2392 cities), random
Facility LocationMinimize transportation costTotal cost (min)OR-Library, random (10–500 facilities)
KnapsackMaximize value within weight capacityTotal value (max)Pisinger, random (50–10000 items)
QAPAssign facilities to locationsTotal flow $\times$ distance (min)QAPLIB (12–256 facilities)
SchedulingJob-shop and flow-shop schedulingMakespan (min)Taillard, random (10×5 to 100×20)

[README] Instance set names and descriptions are from the repository README task listings. Exact file paths for bundled instances within the repository, and whether all listed instance sets are actually included in the PyPI distribution, have not been verified.

8.4.2 ML, Scientific, and Mathematical Tasks

Beyond combinatorial optimization, LLM4AD includes tasks in Bayesian optimization acquisition function design, reinforcement learning reward shaping, CFD turbulence modeling, bacterial growth modeling, circle packing, cap set discovery, and extremal combinatorics [README]. This breadth is important: it tests whether search methods generalize across fundamentally different problem structures, not just variations of the same combinatorial template.

DomainTaskMetricNotes
MLBO AcquisitionRegret (min)Branin, Hartmann, Levy functions
MLRL EnvironmentsCumulative reward (max)CartPole, MountainCar, LunarLander
ScientificCFD TurbulencePrediction error vs DNS (min)Channel flow, flat plate, backward-facing step
ScientificBacteria Growth$R^2$ fit (max)E. coli, S. cerevisiae datasets
MathCircle PackingPacking ratio (max)$n = 5$ to $n = 30$; see §8.7
MathCap Set DiscoveryCap set size (max)$n = 4$ to $n = 8$
MathExtremal CombinatoricsProblem-specificVarious open problems

The mathematical discovery tasks are particularly noteworthy. Circle packing and cap set problems have known best solutions for small instances, providing clear performance targets. For larger instances (e.g., circle packing at $n \geq 20$), state-of-the-art results are actively being improved, making these tasks research-relevant rather than merely pedagogical.

8.5 LLM Backend Abstraction

LLM4AD documents an LLMBackend interface that abstracts provider-specific API details. The repository README documents six backend categories with their import paths and constructor patterns [README]:

BackendDocumented Import [README]Constructor Example [README]Type
OpenAIllm4ad.llm.OpenAIBackendOpenAIBackend(model="gpt-4o", api_key="...")Cloud API
Googlellm4ad.llm.GoogleBackendGoogleBackend(model="gemini-2.0-pro", api_key="...")Cloud API
DeepSeekllm4ad.llm.DeepSeekBackendDeepSeekBackend(model="deepseek-v3", api_key="...")Cloud API
Anthropicllm4ad.llm.AnthropicBackendAnthropicBackend(model="claude-sonnet-4-20250514", api_key="...")Cloud API
vLLMllm4ad.llm.VLLMBackendVLLMBackend(model_path="meta-llama/Llama-3.1-70B")Local
Ollamallm4ad.llm.OllamaBackendOllamaBackend(model="llama3.1:70b")Local

[README] Class names and constructor parameters are reproduced from the README documentation. Whether these backends wrap the provider SDKs directly, use an intermediate abstraction, or employ synchronous vs. asynchronous I/O internally is not documented at the public API level. Whether these exact class names and module paths exist in the installed package has not been verified against source code.

Custom backends are documented with the following template [README, “Custom Backend Integration” section]:

# Custom backend template
# Reproduced from LLM4AD repository README, "Custom Backend Integration" section
# STATUS: README-documented template, not execution-validated
from llm4ad.llm import LLMBackend, LLMResponse

class MyBackend(LLMBackend):
    """Custom LLM backend for LLM4AD."""

    async def generate(self, prompt: str, **kwargs) -> LLMResponse:
        response = await my_api(prompt, **kwargs)
        return LLMResponse(
            text=response.text,
            tokens_used=response.total_tokens,
            model=self.model_name,
        )

    def get_model_info(self) -> dict:
        return {
            "name": self.model_name,
            "context_window": 128000,
            "supports_system_prompt": True,
        }

[README] Note the async signature on generate()—this suggests the LLM layer uses asynchronous I/O, though whether the evaluation engine also uses async or bridges to synchronous multiprocessing is not documented. The self.model_name attribute is referenced but its initialization (presumably in the base class constructor) is not shown in the template.

The LLM abstraction layer is important for fair comparison: by running the same search method with different LLMs, researchers can isolate the contribution of model quality from algorithmic strategy.

8.6 Repository-Reported Results

This section presents benchmark results as reported in the LLM4AD repository documentation. These results have not been independently reproduced for this survey. The methodological limitations of these results are consolidated in Section 8.6.4 to avoid repetitive caveats; readers should consult that section before interpreting any individual table.

Evidence standard for all tables in this section. The tables below reproduce numbers from the LLM4AD repository’s benchmark documentation. The exact artifact location within the repository (specific README section, results file, notebook, or figure) could not be pinpointed to a stable anchor or versioned document. The repository does not report trial counts, random seeds, standard deviations, confidence intervals, hardware specifications, or experiment dates for any benchmark table. Consequently, all results in this section should be treated as single-observation summaries from the repository documentation, not as statistically validated evidence. Differences between methods that fall within plausible single-run variance (e.g., 0.02 percentage points in waste ratio) should not be interpreted as meaningful.

8.6.1 Bin Packing (Falkenauer Triplet Instances)

The following table summarizes repository-reported results for seven methods using GPT-4o on Falkenauer triplet bin-packing instances [README benchmark section].

MethodLLMWaste (%)Gap to BKS (%)EvaluationsTime (min)
ReEvoGPT-4o~1.2~0.03~900~55
MEoHGPT-4o~1.2~0.05~1,500~72
EoHGPT-4o~1.2~0.08~1,000~45
MCTS-AHDGPT-4o~1.3~0.10~800~60
(1+1)-EPSGPT-4o~1.3~0.15~200~20
FunSearchGPT-4o~1.4~0.20~5,000~180
LLaMEAGPT-4o~1.4~0.25~300~35
Best known solution (reported)1.150.00

[README benchmark section] Approximate values (marked with ~) indicate that the original numbers are reproduced from repository documentation whose exact provenance within the repository could not be verified against a stable artifact. This table does not represent a controlled experiment. Evaluation budgets differ by 25× (200 for EPS vs. 5,000 for FunSearch), reflecting each method’s documented operating point. Ranking methods by waste percentage alone is misleading without budget normalization; see Section 8.6.4.

8.6.2 TSP (TSPLIB Instances)

The following table reports optimality gaps on three TSPLIB instances [README benchmark section].

Methodeil51 Gap%att48 Gap%kroA100 Gap%Avg Gap%
ReEvo~1.8~1.5~2.9~2.1
MCTS-AHD~2.0~1.7~3.2~2.3
EoH~2.1~1.8~3.5~2.5
FunSearch~2.5~2.2~4.1~2.9
(1+1)-EPS~2.8~2.5~4.5~3.3

[README benchmark section] Approximate values. Per-method evaluation budgets are not reported for this table. Gap% is relative to the optimal TSPLIB solution for each instance. MEoH and LLaMEA results are not shown; MEoH requires multi-objective task formulation and LLaMEA targets general metaheuristic generation rather than TSP-specific heuristics. Without evaluation budgets, method rankings are purely descriptive and cannot be interpreted as efficiency comparisons. Trial count, seeds, and variance are not reported.

8.6.3 LLM Comparison (EoH on Bin Packing)

The following table compares LLM backends using EoH on the same bin-packing task, isolating model quality from search method [README benchmark section]:

LLMWaste (%)Cost per RunEval RateSuccess Rate
Claude Sonnet 4~1.2~$18~18 eval/min~88%
GPT-4o~1.2~$15~22 eval/min~85%
Gemini 2.0 Pro~1.3~$12~25 eval/min~82%
DeepSeek V3~1.3~$4~15 eval/min~78%
Llama 3.1 70B (local)~1.5$0 (GPU cost)~8 eval/min~65%

[README benchmark section] Approximate values. “Success rate” refers to the fraction of LLM-generated candidates that are syntactically valid, execute without error, and produce a finite score [README]. “Cost per run” is documented as total API expenditure for a complete optimization run. These figures likely represent single-run observations rather than averaged statistics; the documentation does not specify trial count or variance. API pricing changes over time, so cost figures reflect pricing at an unspecified experiment date.

Two qualitative trends merit attention, with the caveat that they rest on unreplicated observations. First, the spread in solution quality across LLMs (roughly 0.3 percentage points in waste ratio) is comparable in magnitude to the spread across search methods (roughly 0.2 percentage points from Section 8.6.1), suggesting that LLM choice may matter as much as method choice for this particular task. Second, there appears to be a cost-quality tradeoff: the cheapest cloud API (DeepSeek V3) achieves results close to GPT-4o at substantially lower cost. Local models are free in API cost but notably weaker in both solution quality and success rate.

8.6.4 Methodological Caveats

The following caveats apply to all tables in this section and in Section 8.7. We consolidate them here once; readers should treat every benchmark number in this chapter as subject to these limitations.

  • Non-uniform budgets: The seven methods use dramatically different numbers of evaluations (~200 for EPS vs. ~5,000 for FunSearch). This reflects each method’s natural operating regime but precludes score-based ranking. A methodologically sound comparison would require either (a) iso-budget experiments where all methods receive the same number of evaluations, (b) iso-time experiments with a fixed wall-clock budget, or (c) convergence curves showing score vs. evaluations for each method. None of these are present in the repository documentation.
  • Missing statistical metadata: Number of independent trials, random seeds, standard deviations, and confidence intervals are not reported for any table. Without these, it is impossible to assess whether observed differences are statistically significant or within normal run-to-run variance.
  • Unspecified hardware and experiment dates: CPU/GPU type, memory, parallelism settings, and experiment dates are not reported, making wall-clock times and API costs non-reproducible.
  • Fidelity gaps: LLM4AD reimplements each method against a common interface. Differences between the LLM4AD reimplementation and the original method codebase (see Section 8.3.0) could affect results in ways that are difficult to quantify without running both implementations side by side.
  • Single LLM for cross-method comparison: The cross-method comparison in Sections 8.6.1–8.6.2 uses only GPT-4o. Method rankings may change with different LLMs, as some methods may be more sensitive to LLM quality than others.
  • Provenance uncertainty: The exact location of these results within the repository (README section anchor, results directory, experiment log, or supplementary material) could not be pinpointed to a stable, versioned artifact. Numbers are therefore presented as approximate (~) where we cannot confirm they are reproduced exactly.

Despite these limitations, LLM4AD’s shared-infrastructure approach is a genuine step forward. Even if individual numbers require additional validation, the ability to run seven methods under identical evaluation code eliminates a major class of confounds present in cross-paper comparisons where each method uses its own scoring implementation.

8.7 Circle Packing: Repository-Reported Improvement Claims

The LLM4AD repository documentation reports improved results on the circle packing problem, where $n$ non-overlapping unit circles (radius $r = 1$) must be packed into the smallest possible enclosing circle.

8.7.0 Problem Formulation and Metric

The standard circle-packing-in-a-circle problem seeks to minimize the radius $R^*$ of the smallest enclosing circle that contains $n$ non-overlapping unit disks. Formally, given $n$ unit disks with centers $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^2$, the optimization problem is:

$$\min_{\mathbf{x}_1, \ldots, \mathbf{x}_n} \; R \quad \text{subject to:} \quad \|\mathbf{x}_i - \mathbf{x}_j\| \geq 2 \;\; \forall \, i \neq j, \quad \|\mathbf{x}_i\| \leq R - 1 \;\; \forall \, i$$

where the first constraint ensures non-overlap (centers at least $2r = 2$ apart for unit circles) and the second ensures containment (each unit disk lies entirely within the enclosing circle of radius $R$). The state-of-the-art results for this problem are tracked by Packomania (packomania.com), maintained by Eckard Specht, which provides verified circle coordinates for known best packings.

The LLM4AD repository uses a “packing ratio” metric rather than the enclosing radius directly [README]:

$$\text{ratio}(n) = \frac{\sum_{i=1}^{n} r_i}{R_{\text{enclosing}}} = \frac{n \cdot 1}{R_{\text{enclosing}}} = \frac{n}{R_{\text{enclosing}}}$$

where $r_i = 1$ for all unit circles. Higher ratio values indicate tighter packing (smaller enclosing circle for the same number of unit circles). This is a monotone transformation of the standard objective: maximizing ratio$(n)$ is equivalent to minimizing $R_{\text{enclosing}}$, since ratio$(n) = n / R_{\text{enclosing}}$ is strictly decreasing in $R_{\text{enclosing}}$ for fixed $n$.

Important note on this metric. This ratio is not a density measure (which would be the area ratio $n \pi r^2 / \pi R^2 = n / R^2$); it is a radius-sum-to-enclosing-radius ratio that scales linearly with $n$. It is also not the standard metric used in the circle-packing literature, where results are typically reported as the enclosing radius $R$ or the minimum enclosing circle diameter $D = 2R$. This non-standard metric choice means that LLM4AD’s ratio values cannot be directly compared with Packomania entries without conversion ($R = n / \text{ratio}$). Whether LLM4AD’s evaluation function computes this ratio from exact geometric coordinates or from a different representation is not documented [README mentions evaluator="exact" but does not specify the evaluation algorithm].

8.7.1 Repository-Reported Results

$n$Previous Best (reported)LLM4AD Result (reported)StatusEvaluations
102.386602.38660Matched~500
152.475402.47540Matched~1,200
202.520402.52042Claimed improvement~3,000
222.562872.56290Claimed improvement~5,000
242.602402.60248Claimed improvement~8,000
262.635902.63594Claimed improvement~15,000
Verification status: repository-reported improvement claims only. These results are reproduced from the LLM4AD repository documentation [README]. They have not been independently verified. Independent verification would require:
  1. Confirming the “previous best” values against Packomania (packomania.com) or another authoritative reference at the time of the experiment. The experiment date is not specified, and Packomania entries may have been updated since.
  2. Obtaining exact circle coordinates $\{(x_i, y_i)\}_{i=1}^{n}$ for each claimed improvement and verifying geometric feasibility: non-overlap ($\|\mathbf{x}_i - \mathbf{x}_j\| \geq 2$ for all $i \neq j$) and containment ($\|\mathbf{x}_i\| \leq R - 1$ for all $i$).
  3. Converting the reported ratio values to enclosing radii and cross-referencing against contemporaneous results from other systems.

The repository documentation does not provide circle coordinates, exact experiment dates, hardware specifications, trial counts, or cross-references to Packomania entries. The method used was documented as FunSearch with Gemini 2.0 Pro [README].

The claimed improvement at $n = 26$—a ratio of 2.63594 versus a reported previous best of 2.63590—represents a marginal gain of approximately 0.0015%. Converting to the standard enclosing-radius formulation: $R = n/\text{ratio}$, so $R = 26/2.63594 \approx 9.8637$ versus $R = 26/2.63590 \approx 9.8638$, a difference of approximately $0.0001$ in enclosing radius. Improvements at this scale, while potentially valid, are within the range where numerical precision of the geometric verification (floating-point rounding in distance computations) becomes relevant.

8.7.2 Documented FunSearch Configuration for Circle Packing

The documented FunSearch configuration for circle packing uses substantially larger parameters than for combinatorial optimization tasks [README]:

# Reproduced from LLM4AD README — circle packing configuration
# STATUS: README-documented example, not execution-validated
from llm4ad import LLM4AD
from llm4ad.methods import FunSearch
from llm4ad.tasks import CirclePacking
from llm4ad.llm import GoogleBackend

task = CirclePacking(
    n=26,
    metric="packing_ratio",
    direction="maximize",
    evaluator="exact",
    timeout_per_eval=120,
)

method = FunSearch(
    num_islands=15,
    programs_per_island=100,
    num_samplers=8,
    samples_per_prompt=8,
    temperature=1.0,
    top_k_for_prompt=3,
    reset_period=200,
)

llm = GoogleBackend(model="gemini-2.0-pro")
runner = LLM4AD(method=method, task=task, llm=llm, max_workers=16)
result = runner.run()

[README] This configuration uses 15 islands × 100 programs = 1,500 stored programs, 8 parallel samplers generating 8 samples each, and allows 120-second evaluations. The evaluator="exact" parameter presumably selects exact geometric computation over approximate methods, but its precise semantics (e.g., the algorithm used, precision guarantees, handling of near-touching circles) are not documented beyond this parameter name. Whether this example was actually executed as shown, or is an illustrative configuration, is not stated.

This heavy configuration reflects the difficulty of mathematical discovery tasks, where the search space is combinatorially larger and marginal improvements require extensive exploration. The ~15,000 evaluations documented for $n = 26$ represent 3× the evaluations documented for $n = 24$ and 30× those for $n = 10$, illustrating the growth in search difficulty with problem size.

8.8 Implementation Details & Reproducibility

This section describes implementation-level details of LLM4AD. To help readers assess the reliability of each claim, we distinguish three evidence levels:

  • README-documented: Directly stated in the repository README or API reference section. These are claims the project makes about itself.
  • Code-observable: Verified by inspecting the repository source code or installing the package. No claims in this chapter reach this evidence level—the analysis is based entirely on documentation.
  • Inferred: Deduced from documentation patterns, example code, or architectural consistency, but not directly stated.

8.8.1 Installation and Package Structure

LLM4AD is distributed via PyPI (pip install llm4ad) with optional extras [README]. The README documents the following installation commands:

# Installation commands reproduced from repository README
# STATUS: README-documented
pip install llm4ad          # Base installation
pip install llm4ad[gui]     # With web-based GUI
pip install llm4ad[all]     # With all LLM backends
pip install -e ".[dev]"     # Development installation from cloned repo

[README] The PyPI package name, available extras, supported Python versions (3.9–3.12), and MIT license are documented in the repository. Whether the PyPI package is actively maintained, what version was current at the time of this survey, and whether all documented extras install successfully have not been independently verified.

8.8.2 Checkpoint and Resume

The README documents periodic checkpointing and a resume() class method [README]:

# Reproduced from LLM4AD README — checkpoint/resume usage
# STATUS: README-documented, NOT verified against source code or by execution
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    checkpoint_dir="./checkpoints/exp-1",
    checkpoint_interval=10,
)
result = runner.run()

# Resume from checkpoint after crash/interruption
runner = LLM4AD.resume("./checkpoints/exp-1")
result = runner.run()  # Documented as continuing from last checkpoint
Documented feature vs. observed behavior. The following aspects of checkpoint/resume are README-documented only and have not been verified by source inspection or execution testing:
  • The serialization format used for checkpoints (pickle, JSON, custom).
  • The exact state preserved (population contents, evaluation history, random seeds, LLM call state, cache contents).
  • Whether resume produces identical results to an uninterrupted run (determinism under resume).
  • Whether resume works across all seven search methods. The checkpoint interval is documented in units of “generations,” but methods like FunSearch do not have a natural generation concept—the interpretation for non-generational methods is not specified.
  • Whether the LLM4AD.resume() classmethod signature matches the documented form.

8.8.3 Logging and Visualization

The README documents integration with both Weights & Biases and TensorBoard, with the ability to use both simultaneously via log_to=["wandb", "tensorboard"] [README]. The documentation states that the logging subsystem records convergence curves, population diversity metrics, per-evaluation scores, LLM token usage, and wall-clock timing [README].

The GUI mode is described as providing a web-based dashboard with real-time monitoring, method configuration, result comparison, and code browsing with syntax highlighting [README]. The GUI is launched via llm4ad gui --port 8080 or llm4ad.gui.launch(port=8080) [README].

Documented features vs. observed behavior: All logging and GUI capabilities described above are from the README. Whether W&B integration logs the specific metrics listed, whether the GUI includes all described features (real-time monitoring, visual configuration, code browsing), and whether the CLI entry point llm4ad gui exists in the installed package are not verified. The README provides no screenshots of the GUI.

8.8.4 Process Isolation, Not Sandboxing

An important clarification on security: LLM4AD’s evaluation engine is documented as using Python multiprocessing for process isolation [README, inferred from architecture description]. This provides protection against:

  • Memory leaks in generated code (each evaluation runs in a fresh process)
  • Infinite loops (terminated by the timeout guard)
  • Crashes in generated code (the worker process terminates without affecting the main orchestrator)

However, this does not provide security sandboxing. Generated code executes with the same filesystem access, network permissions, and OS-level privileges as the parent process. For research settings where all generated code solves well-defined optimization tasks with known-safe operations (arithmetic, array manipulation, sorting), process isolation is sufficient. For deployment scenarios involving untrusted code, adversarial inputs, or sensitive environments, container-level (Docker) or VM-level isolation would be necessary. The repository documentation does not discuss this distinction or recommend additional isolation measures.

8.8.5 End-to-End Usage Example

The following example is reproduced from the repository README’s “Full Example: Comparing Methods on TSP” section [README]. It demonstrates the documented workflow for running multiple methods on the same task:

# Reproduced from LLM4AD README — "Full Example: Comparing Methods on TSP"
# STATUS: README-documented, NOT execution-validated at any specific version
from llm4ad import LLM4AD
from llm4ad.methods import EoH, FunSearch, ReEvo, MCTSAHD, EPS
from llm4ad.tasks import TSP
from llm4ad.llm import OpenAIBackend

llm = OpenAIBackend(model="gpt-4o")
task = TSP(instance="tsplib/eil51", metric="tour_length", direction="minimize")

methods = {
    "EoH": EoH(population_size=20, num_generations=50),
    "FunSearch": FunSearch(num_islands=5, programs_per_island=20),
    "ReEvo": ReEvo(population_size=15, num_generations=60),
    "MCTS-AHD": MCTSAHD(max_depth=8, num_simulations=200),
    "(1+1)-EPS": EPS(max_iterations=200),
}

results = {}
for name, method in methods.items():
    print(f"\n--- Running {name} ---")
    runner = LLM4AD(method=method, task=task, llm=llm, max_workers=4)
    results[name] = runner.run()
    print(f"{name}: best tour length = {results[name].best_score:.2f}")

# Compare results
print("\n=== Comparison ===")
for name, result in sorted(results.items(), key=lambda x: x[1].best_score):
    print(f"{name:15s}: {result.best_score:.2f} "
          f"(evals: {result.stats.total_evaluations})")
Documented feature vs. observed behavior. This example is reproduced from the repository README, not from independent execution. The following assumptions embedded in this example are README-documented only:
  • The import paths (from llm4ad import LLM4AD, from llm4ad.methods import EoH, etc.) exist in the installed package.
  • The tsplib/eil51 instance identifier resolves to a bundled or downloadable problem instance.
  • The result.stats.total_evaluations attribute path exists on the return object.
  • The result.best_score attribute contains a numeric value suitable for sorting and formatting.
The README also documents result.plot_convergence(), result.plot_population_diversity(), result.export_csv(), and result.export_latex_table() methods on the result object [README]; their exact output format and behavior are not specified and have not been tested.

8.8.6 Documented Result Object

The README’s API reference section documents the OptimizationResult object returned by runner.run() [README]:

# From LLM4AD README — OptimizationResult API reference
# STATUS: README-documented interface, NOT verified against source code
# Types EvaluationRecord, Individual, and RunStats are referenced
# but their fields are not fully documented in the README.
class OptimizationResult:
    best_code: str                         # Best algorithm code
    best_score: float                      # Best score achieved
    best_thought: str | None               # Best algorithmic idea (EoH/MEoH only)
    history: list[EvaluationRecord]        # Full evaluation history
    population: list[Individual]           # Final population
    stats: RunStats                        # Runtime statistics

    def plot_convergence(self, save_path: str = None): ...
    def plot_population_diversity(self, save_path: str = None): ...
    def export_csv(self, path: str): ...
    def export_latex_table(self) -> str: ...

[README] Whether best_thought is None for non-EoH methods, raises an AttributeError, or is simply absent from the result object is not specified. Whether this class is implemented as a dataclass, Pydantic model, or plain class is not documented. The helper types EvaluationRecord, Individual, and RunStats are referenced but their attributes are not enumerated in the README.

8.9 Comparative Analysis

LLM4AD occupies a distinctive position in the LLM-powered evolutionary systems landscape: it is a unification platform rather than a novel search method. This distinction shapes how it should be evaluated relative to other systems.

8.9.1 Platform Comparison

Feature LLM4AD AlphaEvolve OpenEvolve GEPA
Open sourceMITNoYesYes
Search methods7 integrated1 (proprietary)1 (FunSearch-inspired)1 (Pareto EA)
Built-in tasks12+Custom onlyCustom onlyCustom only
LLM backends6 categoriesGoogle onlyMultipleMultiple
GUIYes (documented)NoNoNo
Multi-objectiveYes (MEoH)WeightedWeightedPareto
CheckpointingYes (documented)YesYesYes
Experiment trackingW&B + TensorBoard (documented)InternalCustomCustom
Evaluation isolationmultiprocessing (process-level)Container/sandboxsubprocesssubprocess
Fidelity validationNot reportedN/A (single method)N/AN/A

Comparison based on publicly available documentation as of early 2026. LLM4AD features marked “(documented)” are claimed in the README but not source-verified for this chapter. AlphaEvolve’s internal capabilities are not fully disclosed; its evaluation isolation column reflects the white paper’s description of containerized execution.

8.9.2 Novelty and Fidelity Assessment

LLM4AD’s novelty lies in integration engineering, not in algorithmic innovation. None of the seven search methods were invented by the LLM4AD team; each was published independently at top venues. The contribution is the infrastructure that makes controlled comparison possible. This is a valuable contribution—analogous to how benchmark suites like TSPLIB or platforms like OpenAI Gym advanced their respective fields by enabling reproducible comparison rather than introducing new algorithms.

The platform faces a fundamental tension between breadth and fidelity. Reimplementing seven published methods introduces the risk of fidelity gaps: if the reimplemented version of FunSearch or ReEvo behaves differently from the authors’ original code due to subtle differences in prompt construction, selection logic, or hyperparameter defaults, then “fair comparison on LLM4AD” may not accurately reflect the relative strengths of the original methods. As documented in Section 8.3.0, we identified at least two known modifications (FunSearch’s LLM backend substitution and migration addition; LLaMEA’s population-based extension) and several unknown fidelity statuses. The repository documentation does not include validation against original codebases, which is the most significant evidence gap.

8.9.3 Breadth vs. Depth Tradeoff

LLM4AD’s breadth—7 methods, 12+ tasks, 6 backend categories—is simultaneously its greatest strength and a potential vulnerability. Systems like OpenEvolve or GEPA focus on a single search strategy, allowing deep optimization of prompt engineering, caching, and adaptation specific to that strategy. LLM4AD must maintain a general-purpose evaluation engine and interface that accommodates fundamentally different search paradigms (population-based evolution, tree search, single-solution hill climbing, island models), which may prevent method-specific optimizations.

For researchers, the appropriate tool depends on the question being asked:

  • “Which search method works best for my problem?” → LLM4AD’s multi-method comparison capability is uniquely valuable, provided the researcher controls for budget, runs multiple trials, and accounts for potential fidelity gaps.
  • “How do I squeeze maximum performance from FunSearch on circle packing?” → The original FunSearch codebase or a specialized fork may be more appropriate, as platform-specific overhead and interface constraints may limit method-specific optimizations.
  • “I want to quickly prototype a new search method against established baselines.” → LLM4AD’s built-in task library and baseline implementations provide a fast starting point.

8.10 Limitations & Open Questions

8.10.1 Technical Limitations

  • Python-only generation: All generated algorithms are Python code. For computationally intensive tasks (CFD, large-scale TSP), Python’s overhead may be a limiting factor. Support for C++, Rust, or Julia generation would enable access to performance-critical applications.
  • Cost at scale: Running all 7 methods across 12+ tasks with multiple LLMs for robust comparison requires thousands of LLM API calls. A complete comparison matrix with multiple trials could cost thousands of dollars, limiting accessibility for resource-constrained research groups.
  • No cross-method ensembling: Each method runs independently. There is no documented mechanism to share promising candidates or learned patterns across methods during a single run. Cross-method portfolio approaches could potentially outperform any individual method.
  • Evaluation bottleneck: For expensive-to-evaluate tasks (CFD turbulence models, RL environment training), the evaluation cost dominates LLM call cost, and the evaluation engine’s multiprocessing may not provide sufficient parallelism.
  • Process isolation only: The lack of container or VM sandboxing means generated code has full access to the host filesystem and network. While acceptable for research on well-defined optimization tasks, this limits deployment in sensitive or multi-tenant environments.

8.10.2 Evidence Gaps

The current repository documentation leaves several evidence gaps that limit the platform’s value as a definitive comparison tool:

  • No fidelity validation: The most critical gap. Without validation that LLM4AD’s reimplementations match the original method codebases, cross-method comparisons reflect LLM4AD’s versions rather than the published methods. As shown in Section 8.3.0, at least two methods have known modifications (FunSearch, LLaMEA) and most others have unknown fidelity status.
  • No statistical reporting: Seeds, trial counts, standard deviations, and confidence intervals are absent from all benchmark tables.
  • No convergence curves: Score-vs-evaluations curves would enable budget-normalized comparison and reveal whether methods are still improving when the run ends.
  • No ablation studies: The contribution of shared infrastructure vs. method-specific tuning is not isolated.
  • No source-code-level API verification: All API descriptions in this chapter are based on README documentation. An independent audit of the actual codebase would strengthen confidence in the documented interfaces.

8.10.3 Research Questions

LLM4AD’s multi-method infrastructure opens several research questions that would be difficult to investigate without such a platform:

  • Method-task affinity: Do certain search methods systematically outperform others on specific problem types? The repository-reported results suggest ReEvo performs well on combinatorial optimization, but whether this generalizes requires controlled experiments with proper statistical methodology.
  • LLM-method interaction: Does the relative ranking of search methods change when the underlying LLM changes? A method designed for high-quality LLMs (like ReEvo’s reflection) might degrade more sharply with weaker models than simpler methods like (1+1)-EPS.
  • Budget-normalized comparison: How do methods compare when given equal compute budgets (wall-clock time or total LLM tokens) rather than method-specific operating points?
  • Fidelity validation: How closely do LLM4AD’s reimplementations match the original method codebases? Running both the original and LLM4AD versions on the same task with the same LLM would quantify fidelity gaps.

8.10.4 Documented Planned Extensions

The repository documentation describes several planned features [README]: distributed execution via Ray or Dask, automatic method selection based on problem characteristics, multi-language code generation, a community-contributed benchmark hub, and meta-learning for warm-starting new runs. Whether any of these have been implemented or are under active development is not known.

8.11 Research Significance

LLM4AD’s contribution to the field is best understood through the lens of research infrastructure rather than algorithmic innovation. The fragmentation problem it addresses is real and consequential: without a unified comparison platform, claims about method superiority are confounded by implementation differences, making it difficult for the community to build cumulative knowledge about what works and why.

The platform makes three specific contributions to the research ecosystem:

  1. Standardized evaluation: A shared evaluation engine that eliminates scoring, timing, and process-isolation discrepancies across methods. This is the most technically significant contribution, as it addresses a structural problem in the field. However, its value is limited by the absence of fidelity validation (Section 8.3.0) and statistical methodology (Section 8.6.4).
  2. Accessible experimentation: The documented GUI, built-in tasks, and comprehensive documentation lower the barrier to entry for researchers who want to experiment with LLM-based algorithm design without building infrastructure from scratch.
  3. Reproducible baselines: By providing a single codebase where all methods can be run, LLM4AD serves as a reference implementation and baseline generator for future research. New methods can be compared against all seven existing methods under controlled conditions, provided the researcher designs experiments with appropriate statistical rigor.

The circle packing results (Section 8.7), if independently verified with exact geometric coordinates and cross-referenced against Packomania, would demonstrate that a research platform can produce competitive results on open mathematical problems. However, this claim currently rests on repository documentation without third-party confirmation or published coordinates.

8.12 Cross-Method Architectural Comparison

The following diagram illustrates how the seven integrated methods differ in their search topology, from single-solution (EPS) through population-based (EoH, ReEvo, LLaMEA) and island-based (FunSearch) to tree-structured (MCTS-AHD):

SEARCH TOPOLOGY SPECTRUM (7 METHODS) Simple Complex (1+1)-EPS 1 Single solution Greedy accept Hill climbing EoH Population Thought + Code Co-evolution ReEvo Population + Reflection loop LLaMEA portfolio Population Meta-heuristics FunSearch I1 I2 I3 Island model Best-shot sampling Migration + reset MCTS-AHD Tree search UCB selection MEoH Extends EoH with Pareto selection + crowding distance ALL METHODS SHARE Evaluation Engine LLM Backend Task Definitions Logging / Checkpoints Process-isolated evaluation with configurable timeout, content-hash caching, and multiprocessing

8.13 Summary

Key Takeaway

LLM4AD is the most comprehensive open-source platform for LLM-based automatic algorithm design, integrating seven published search methods under a unified evaluation infrastructure with 12+ built-in optimization tasks and support for six LLM backend categories. Its principal contribution is enabling controlled cross-method comparison—a structural advance for a field where prior results were confounded by implementation differences across independent codebases.

Evidence Quality Assessment

Claim CategorySourceVerification Level
Architecture & designRepository READMEREADME-documented; internal structure not code-verified
API surface (class names, imports)README code examplesREADME-documented; not execution-tested at a pinned version
Benchmark results (bin packing, TSP)Repository benchmark docsREADME-documented; no seeds, trials, CI, or stable artifact link
Circle packing improvementsRepository docsRepository-claimed; no coordinates, dates, or third-party verification
Method descriptions (EoH, FunSearch, etc.)Original papersPaper-grounded; LLM4AD fidelity verified only at conceptual level (Section 8.3.0)
LLM backend supportREADMEREADME-documented; backend classes and behavior not source-verified
Checkpoint/resume, GUI, loggingREADMEREADME-documented feature; no observed implementation behavior verified

Main Contribution to the Field

By providing shared evaluation infrastructure, standardized task definitions, and a common LLM abstraction layer, LLM4AD transforms the question “which LLM-based algorithm design method is best?” from an unanswerable cross-paper comparison into an empirically testable hypothesis—contingent on researchers designing experiments with proper controls. This is an infrastructure contribution rather than an algorithmic one, but it is precisely the type of contribution the field needs at this stage of its development.

What a Researcher Should Know

LLM4AD’s repository-reported benchmarks show ReEvo and EoH performing well on combinatorial optimization tasks, but these results lack statistical reporting and use non-uniform evaluation budgets. The circle packing improvement claims are repository-reported only without published coordinates or independent verification. The most significant gap is the absence of fidelity validation between LLM4AD’s method reimplementations and original codebases (Section 8.3.0 identifies known modifications for FunSearch and LLaMEA). When using LLM4AD for method comparison, researchers should: (1) design experiments with controlled budgets or budget-normalized metrics, (2) run multiple trials with different seeds and report confidence intervals, (3) consider running the original method codebase alongside LLM4AD’s version to assess fidelity, and (4) report exact package version or commit hash for reproducibility.