← Back to Index

LLM4AD

Unified Open-Source Platform for LLM-based Automatic Algorithm Design Authors: City University of Hong Kong & Southern University of Science and Technology Repository: github.com/Optima-CityU/llm4ad License: MIT Python: 3.9–3.12

**Key Contribution:** LLM4AD provides a unified platform integrating 7 state-of-the-art search methods for automatic algorithm design, supporting 12+ combinatorial optimization tasks and multiple LLM backends. It achieves a world record for circle packing at n=26 with score 2.63594.

Table of Contents

1. Overview & Motivation

LLM4AD (Large Language Models for Automatic Algorithm Design) addresses the fragmentation problem in the LLM-for-optimization research landscape. Each published method (EoH, FunSearch, ReEvo, etc.) comes with its own codebase, API, evaluation harness, and LLM integration. Researchers wanting to compare methods must re-implement or adapt each system independently. LLM4AD provides a unified platform where all methods share:

  • A common interface for defining optimization problems (tasks)
  • A unified LLM abstraction layer supporting GPT-4o, Gemini Pro, DeepSeek, and local models (Llama, Gemma)
  • Shared evaluation infrastructure with multiprocessing, timeouts, and caching
  • Consistent logging and visualization via W&B and TensorBoard
  • A graphical user interface (GUI) for interactive experimentation
  • Resumable runs with checkpoint/restart capability

Design Philosophy

Modular & Extensible

Each search method, task, and LLM backend is a pluggable module. Adding a new search method requires implementing a single interface. Adding a new task requires defining an evaluation function and problem specification.


Reproducible & Fair

All methods run under identical conditions — same LLM, same evaluation budget, same hardware. This enables fair apples-to-apples comparison that is impossible when methods use different codebases.


Practical & Production-Ready

Beyond research, LLM4AD is designed for real-world use. Multiprocessing, timeout handling, GPU-aware scheduling, and resumable runs make it suitable for long-running optimization campaigns.


Educational & Accessible

The GUI mode and comprehensive documentation lower the barrier to entry. Researchers can experiment with different methods without deep understanding of each algorithm's internals.


2. Installation & Quick Start

Installation

# Basic installation
pip install llm4ad

# With GUI support
pip install llm4ad[gui]

# With all LLM backends
pip install llm4ad[all]

# Development installation
git clone https://github.com/Optima-CityU/llm4ad.git
cd llm4ad
pip install -e ".[dev]"

Quick Start: Bin Packing with EoH

from llm4ad import LLM4AD, EoH, BinPacking
from llm4ad.llm import OpenAIBackend

# Configure LLM backend
llm = OpenAIBackend(model="gpt-4o", api_key="sk-...")

# Define the task
task = BinPacking(
    instance="benchmark/bpp_500_items",
    metric="waste_ratio",
    direction="minimize",
)

# Select search method
method = EoH(
    population_size=20,
    num_generations=50,
    crossover_rate=0.3,
    mutation_rate=0.7,
    elite_size=3,
)

# Run optimization
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    max_workers=4,
    timeout_per_eval=30,
    log_to="wandb",          # Log to Weights & Biases
    project_name="llm4ad-bpp",
)

result = runner.run()

print(f"Best waste ratio: {result.best_score:.4f}")
print(f"Best algorithm:\n{result.best_code}")

GUI Mode

# Launch the graphical interface
llm4ad gui --port 8080

# Or from Python
from llm4ad.gui import launch
launch(port=8080)

3. Platform Architecture

+-----------------------------------------------------------------------------------+
|                            LLM4AD Platform Architecture                            |
+-----------------------------------------------------------------------------------+
|                                                                                    |
|  +----------------------------------+   +--------------------------------------+   |
|  |          User Interface           |   |          Configuration Layer          |   |
|  |                                   |   |                                      |   |
|  |  CLI  |  Python API  |  GUI (Web) |   |  YAML Config  |  Programmatic API   |   |
|  +----------------------------------+   +--------------------------------------+   |
|                        |                              |                            |
|                        v                              v                            |
|  +------------------------------------------------------------------------+        |
|  |                       LLM4AD Orchestrator                               |        |
|  |                                                                         |        |
|  |   +-------------------+    +-------------------+    +----------------+  |        |
|  |   | Search Method     |    | Task Registry     |    | LLM Backend    |  |        |
|  |   | Selector          |    |                   |    | Manager        |  |        |
|  |   |                   |    | - BinPacking      |    |                |  |        |
|  |   | - EoH             |    | - TSP             |    | - OpenAI       |  |        |
|  |   | - MEoH            |    | - FacilityLoc     |    | - Google       |  |        |
|  |   | - FunSearch        |    | - Knapsack        |    | - DeepSeek     |  |        |
|  |   | - ReEvo           |    | - QAP             |    | - Local (vLLM) |  |        |
|  |   | - MCTS-AHD        |    | - Scheduling      |    |                |  |        |
|  |   | - (1+1)-EPS       |    | - BO Acquisition  |    +----------------+  |        |
|  |   | - LLaMEA          |    | - RL Environments |                        |        |
|  |   +-------------------+    | - CFD Turbulence  |                        |        |
|  |           |                | - Bacteria Growth |                        |        |
|  |           |                | - Math Discovery  |                        |        |
|  |           |                +-------------------+                        |        |
|  |           v                         |                                   |        |
|  |   +------------------------------------------------------------------+ |        |
|  |   |                    Evaluation Engine                              | |        |
|  |   |                                                                   | |        |
|  |   |  [Multiprocessing Pool] --> [Timeout Guard] --> [Score Collect]  | |        |
|  |   |        |                        |                      |          | |        |
|  |   |   max_workers=N           per-eval timeout         Caching        | |        |
|  |   |   Process isolation       Graceful kill            Content hash   | |        |
|  |   +------------------------------------------------------------------+ |        |
|  |                              |                                          |        |
|  |                              v                                          |        |
|  |   +------------------------------------------------------------------+ |        |
|  |   |                    Logging & Visualization                        | |        |
|  |   |                                                                   | |        |
|  |   |  W&B Integration  |  TensorBoard  |  CSV Export  |  Checkpoints  | |        |
|  |   +------------------------------------------------------------------+ |        |
|  +------------------------------------------------------------------------+        |
+-----------------------------------------------------------------------------------+

Key Architectural Decisions

  • Plugin Architecture: Search methods, tasks, and LLM backends are all registered via a plugin system. New components can be added without modifying core platform code.
  • Shared Evaluation: All search methods use the same evaluation engine, ensuring identical sandboxing, timing, and scoring across methods.
  • Checkpoint/Resume: The orchestrator periodically checkpoints the entire state (population, history, random seeds), enabling resumable runs after crashes or interruptions.
  • Process Isolation: Each evaluation runs in a separate process with strict timeout enforcement, preventing infinite loops or memory leaks from crashing the main process.

4. Search Methods (7 Algorithms)

4.1 EoH — Evolution of Heuristics (ICML 2024)

**Paper:** "Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model"

**Venue:** `ICML 2024`

EoH evolves both the algorithmic idea (natural language description) and the code implementation simultaneously. This dual representation enables the LLM to reason about high-level algorithmic concepts while grounding them in executable code.

Key Mechanism: Thought-Code Co-Evolution

  • Each individual in the population is a (thought, code) pair
  • Mutation operates on both: the thought is mutated first, then the code is updated to match
  • Crossover combines thoughts from two parents, then generates code for the combined thought
  • Selection is based on code execution performance, but thought quality influences mutation quality
from llm4ad.methods import EoH

method = EoH(
    population_size=20,            # Population of (thought, code) pairs
    num_generations=50,            # Number of evolutionary generations
    crossover_rate=0.3,            # Probability of crossover vs mutation
    mutation_rate=0.7,             # Probability of mutation
    elite_size=3,                  # Top-3 preserved across generations
    thought_mutation_temp=0.9,     # Higher temp for thought diversity
    code_mutation_temp=0.7,        # Lower temp for code precision
    tournament_size=3,             # Tournament selection size
)

EoH Operators

Operator Description Input Output
e1: Thought Mutation LLM modifies the algorithmic idea Parent thought + task description New thought + corresponding code
e2: Code Mutation LLM modifies the code while preserving the thought Parent thought + parent code Same thought + modified code
c1: Thought Crossover LLM combines ideas from two parents Two parent thoughts New thought + corresponding code
c2: Code Crossover LLM combines code from two parents Two parent (thought, code) pairs Combined thought + combined code

4.2 MEoH — Multi-objective Evolution of Heuristics (AAAI 2025)

**Paper:** "Multi-objective Evolution of Heuristics Using Large Language Models"

**Venue:** `AAAI 2025`

MEoH extends EoH to handle multiple conflicting objectives simultaneously, maintaining a Pareto frontier of non-dominated (thought, code) pairs.

Key Extensions over EoH

  • Multi-objective evaluation: Each candidate is scored on multiple metrics (e.g., solution quality + runtime)
  • Pareto-based selection: Non-dominated sorting + crowding distance for diverse frontier exploration
  • Objective-aware mutation: The LLM is told which objectives are lagging and asked to focus improvements there
  • Hypervolume tracking: Progress measured by hypervolume indicator rather than single best score
from llm4ad.methods import MEoH

method = MEoH(
    population_size=30,
    num_generations=100,
    objectives=["quality", "runtime"],
    directions=["maximize", "minimize"],
    crossover_rate=0.3,
    mutation_rate=0.7,
    reference_point=[0.0, 1000.0],   # For hypervolume computation
    objective_focus_strategy="lagging", # Focus mutation on lagging objectives
)

4.3 FunSearch (Nature 2024)

**Paper:** "Mathematical discoveries from program search with large language models"

**Venue:** `Nature 2024`

FunSearch, developed by DeepMind, uses a best-shot sampling approach: generate many candidates from the LLM, evaluate all of them, and keep only the best. It uses an island model with periodic migration to maintain population diversity.

FunSearch Architecture

  • Sampler Pool: Multiple LLM samplers generate candidates in parallel
  • Programs Database: Island-structured database storing candidates, organized by score tiers
  • Evaluator Pool: Parallel evaluators score candidates, with strict timeouts
  • Best-shot Strategy: From each island, the top-k candidates are selected as prompting examples for the LLM
from llm4ad.methods import FunSearch

method = FunSearch(
    num_islands=10,                # Number of islands
    programs_per_island=50,        # Population per island
    num_samplers=4,                # Parallel LLM samplers
    samples_per_prompt=4,          # Candidates per LLM call
    temperature=1.0,               # High temperature for diversity
    top_k_for_prompt=2,            # Best-k examples in prompt
    reset_period=100,              # Reset worst island every N evals
    migration_interval=50,         # Migrate between islands
)

4.4 ReEvo — Reflective Evolution (NeurIPS 2024)

**Paper:** "ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution"

**Venue:** `NeurIPS 2024`

ReEvo introduces reflective evolution where the LLM not only generates mutations but also reflects on why previous candidates succeeded or failed. This self-reflective capability produces more targeted and effective mutations.

Reflection Loop

  1. Generate: LLM produces a candidate heuristic
  2. Evaluate: Run the candidate on the benchmark
  3. Reflect: LLM analyzes the evaluation results, identifying strengths and weaknesses
  4. Refine: LLM uses the reflection to propose a targeted improvement
  5. Repeat: The refined candidate becomes the new parent
from llm4ad.methods import ReEvo

method = ReEvo(
    population_size=15,
    num_generations=60,
    reflection_depth=2,            # Number of reflection iterations per mutation
    short_term_memory=5,           # Recent reflections kept in context
    long_term_memory=20,           # Total reflections stored
    reflection_temperature=0.7,    # Temperature for reflection LLM calls
    mutation_temperature=0.8,      # Temperature for mutation LLM calls
)

Reflection Prompt Structure

# ReEvo reflection prompt (simplified)
"""
## Current Heuristic
{code}

## Evaluation Results
Score: {score}
Test case breakdown:
{per_case_results}

## Previous Reflections
{recent_reflections}

## Task
1. Analyze WHY this heuristic scored {score}
2. Identify the specific weakness causing failures
3. Propose a concrete improvement strategy
4. Explain how the improvement addresses the weakness

## Your Reflection:
"""

4.5 MCTS-AHD — Monte Carlo Tree Search for Algorithm Design (ICML 2025)

**Paper:** "Monte Carlo Tree Search for Automatic Heuristic Design"

**Venue:** `ICML 2025`

MCTS-AHD frames algorithm design as a sequential decision problem and uses Monte Carlo Tree Search to explore the space of algorithmic building blocks. Each node in the tree represents an algorithmic component choice.

MCTS Formulation

  • State: Partially specified algorithm (sequence of component choices so far)
  • Action: Choose the next algorithmic component (data structure, operator, control flow)
  • Reward: Evaluation score of the completed algorithm
  • UCB Selection: Balance exploration of new component choices vs exploitation of known-good paths

$$ UCB(s, a) = Q(s, a) + c ⋅ √(ln N(s) / N(s, a)) $$

Where Q(s,a) is the average reward for taking action a in state s, N(s) is the visit count for state s, N(s,a) is the visit count for action a in state s, and c is the exploration constant.

from llm4ad.methods import MCTSAHD

method = MCTSAHD(
    max_depth=8,                   # Maximum tree depth (algorithm complexity)
    num_simulations=200,           # MCTS simulations per step
    exploration_constant=1.414,    # UCB exploration parameter (sqrt(2))
    expansion_width=5,             # Number of children per expansion
    rollout_policy="llm",          # Use LLM for rollout (vs random)
    backprop_strategy="max",       # Backpropagate max score (vs average)
)

4.6 (1+1)-EPS — Evolutionary Program Search (PPSN 2024)

**Paper:** "(1+1)-Evolutionary Program Search"

**Venue:** `PPSN 2024`

(1+1)-EPS is the simplest method: it maintains a single solution and iteratively mutates it, keeping the mutation only if it improves the score. Despite its simplicity, it is surprisingly effective for well-defined optimization tasks.

Algorithm

# (1+1)-EPS pseudocode
def eps_search(initial_solution, evaluate, llm, max_iterations):
    current = initial_solution
    current_score = evaluate(current)

    for i in range(max_iterations):
        # Mutate the current solution using LLM
        mutant = llm.mutate(current, feedback=get_feedback(current))

        # Evaluate the mutant
        mutant_score = evaluate(mutant)

        # Accept if improved (greedy)
        if mutant_score >= current_score:
            current = mutant
            current_score = mutant_score
            log(f"Iteration {i}: improved to {current_score}")

    return current, current_score
from llm4ad.methods import EPS

method = EPS(
    max_iterations=200,            # Total mutation attempts
    mutation_temperature=0.8,      # LLM temperature for mutations
    include_feedback=True,         # Include eval feedback in mutation prompt
    feedback_window=5,             # Include last 5 mutations in context
    restart_on_stagnation=True,    # Restart from best-so-far after N stagnant iters
    stagnation_threshold=30,       # Iterations without improvement to trigger restart
)

4.7 LLaMEA (IEEE TEVC 2025)

**Paper:** "LLaMEA: A Large Language Model Evolutionary Algorithm for Automatically Generating Metaheuristics"

**Venue:** `IEEE TEVC 2025`

LLaMEA generates entire metaheuristic algorithms (not just heuristics for specific problems). The LLM is prompted to generate complete evolutionary or swarm-based algorithms, which are then evaluated on a portfolio of benchmark problems.

Key Difference from Other Methods

While EoH/ReEvo evolve problem-specific heuristics, LLaMEA evolves general-purpose optimization algorithms that can be applied to any problem. The output is a complete metaheuristic (like a new variant of PSO or DE) rather than a heuristic for TSP.

from llm4ad.methods import LLaMEA

method = LLaMEA(
    population_size=10,
    num_generations=30,
    algorithm_template="metaheuristic",  # Generate full metaheuristics
    benchmark_portfolio=[                # Evaluate on multiple problems
        "sphere_d10", "rastrigin_d10",
        "rosenbrock_d10", "ackley_d10",
    ],
    aggregation="geometric_mean",        # Aggregate scores across benchmarks
    mutation_strategy="component_swap",  # Swap algorithmic components
)

Search Methods Comparison

Method Venue Population Key Mechanism Multi-Obj Reflection Best For
EoH ICML 2024 20–50 Thought-code co-evolution No No General heuristics
MEoH AAAI 2025 30–100 Pareto + objective-aware mutation Yes No Multi-objective problems
FunSearch Nature 2024 Islands Best-shot sampling + islands No No Mathematical discovery
ReEvo NeurIPS 2024 10–30 Self-reflective evolution No Yes Complex heuristics
MCTS-AHD ICML 2025 Tree UCB-guided component selection No No Compositional algorithms
(1+1)-EPS PPSN 2024 1 Greedy hill climbing No Partial Quick prototyping
LLaMEA IEEE TEVC 2025 10–20 Metaheuristic generation No No Algorithm generation

5. Supported Tasks & Benchmarks

Combinatorial Optimization

Task Description Metric Instances
Bin Packing Pack items into fixed-capacity bins minimizing waste Waste ratio (lower is better) Falkenauer, Scholl, random (50–5000 items)
TSP Traveling Salesman Problem — find shortest tour Tour length (lower is better) TSPLIB (14–2392 cities), random
Facility Location Place facilities to minimize total transportation cost Total cost (lower is better) OR-Library, random (10–500 facilities)
Knapsack Select items maximizing value within weight capacity Total value (higher is better) Pisinger, random (50–10000 items)
QAP Quadratic Assignment — assign facilities to locations Total flow*distance (lower is better) QAPLIB (12–256 facilities)
Scheduling Job-shop and flow-shop scheduling Makespan (lower is better) Taillard, random (10x5 to 100x20)

Machine Learning & AI

Task Description Metric Instances
BO Acquisition Design acquisition functions for Bayesian Optimization Regret (lower is better) Branin, Hartmann, Levy functions
RL Environments Design reward shaping or policy heuristics Cumulative reward (higher is better) CartPole, MountainCar, LunarLander

Scientific Computing

Task Description Metric Instances
CFD Turbulence Design turbulence models for computational fluid dynamics Prediction error vs DNS (lower is better) Channel flow, flat plate, backward-facing step
Bacteria Growth Design growth rate models for bacterial colonies Fit to experimental data (higher R²) E. coli, S. cerevisiae datasets

Mathematical Discovery

Task Description Metric Notes
Circle Packing Pack n circles in minimum enclosing circle Ratio of sum of radii to enclosing radius (higher is better) n=5 to n=30
Cap Set Discovery Find large cap sets in GF(3)^n Cap set size (higher is better) n=4 to n=8
Extremal Combinatorics Construct extremal graph colorings Problem-specific score Various open problems

6. LLM Backend Support

Backend Models Type Configuration
OpenAI GPT-4o, GPT-4o-mini, o1, o3 Cloud API OpenAIBackend(model="gpt-4o", api_key="...")
Google Gemini Pro, Gemini Flash, Gemini 2.5 Cloud API GoogleBackend(model="gemini-2.0-pro", api_key="...")
DeepSeek DeepSeek V3, DeepSeek R1 Cloud API DeepSeekBackend(model="deepseek-v3", api_key="...")
Anthropic Claude Sonnet 4, Claude Opus 4 Cloud API AnthropicBackend(model="claude-sonnet-4-20250514", api_key="...")
Local (vLLM) Llama 3.1, Gemma 2, Mistral, Qwen Local VLLMBackend(model_path="meta-llama/Llama-3.1-70B")
Local (Ollama) Any Ollama-supported model Local OllamaBackend(model="llama3.1:70b")

Custom Backend Integration

from llm4ad.llm import LLMBackend, LLMResponse

class MyBackend(LLMBackend):
    """Custom LLM backend for LLM4AD."""

    async def generate(self, prompt: str, **kwargs) -> LLMResponse:
        response = await my_api(prompt, **kwargs)
        return LLMResponse(
            text=response.text,
            tokens_used=response.total_tokens,
            model=self.model_name,
        )

    def get_model_info(self) -> dict:
        return {
            "name": self.model_name,
            "context_window": 128000,
            "supports_system_prompt": True,
        }

7. Platform Features

GUI Interface

LLM4AD includes a web-based GUI for interactive experimentation:

  • Method Configuration: Visual parameter tuning for each search method
  • Real-time Monitoring: Live convergence plots, population diversity metrics, and cost tracking
  • Result Comparison: Side-by-side comparison of multiple runs
  • Code Browser: Inspect generated algorithms with syntax highlighting
  • Export: Export results as CSV, JSON, or LaTeX tables

Logging Integration

# Weights & Biases logging
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to="wandb",
    project_name="my-experiment",
    run_name="eoh-binpacking-v1",
    tags=["eoh", "binpacking", "gpt4o"],
)

# TensorBoard logging
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to="tensorboard",
    log_dir="./runs/experiment-1",
)

# Both simultaneously
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to=["wandb", "tensorboard"],
)

Multiprocessing with Timeout

runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    max_workers=8,            # 8 parallel evaluation workers
    timeout_per_eval=60,      # Kill evaluation after 60 seconds
    memory_limit_mb=2048,     # 2GB memory limit per evaluation
    retry_on_timeout=True,    # Retry timed-out evaluations once
    gpu_allocation={          # GPU assignment per worker
        0: [0, 1],           # Workers 0-1 use GPU 0
        1: [2, 3],           # Workers 2-3 use GPU 1
    },
)

Resumable Runs

# Start a run with checkpointing
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    checkpoint_dir="./checkpoints/exp-1",
    checkpoint_interval=10,     # Checkpoint every 10 generations
)
result = runner.run()

# Resume from checkpoint after crash/interruption
runner = LLM4AD.resume("./checkpoints/exp-1")
result = runner.run()  # Continues from last checkpoint

8. API Reference & Code Examples

Core Classes

[!info]- LLM4AD - Main Orchestrator ``` class LLM4AD: def init( self, method: SearchMethod, # Search algorithm to use task: Task, # Optimization task llm: LLMBackend, # LLM backend max_workers: int = 4, # Parallel evaluation workers timeout_per_eval: int = 60, # Evaluation timeout (seconds) log_to: str | list = None, # Logging backend(s) checkpoint_dir: str = None, # Directory for checkpoints checkpoint_interval: int = 10, # Checkpoint frequency seed: int = 42, # Random seed verbose: bool = True, # Print progress ): ...

def run(self) -> OptimizationResult:
    """Run the optimization until stopping criteria are met."""
    ...

@classmethod
def resume(cls, checkpoint_dir: str) -> "LLM4AD":
    """Resume from a checkpoint."""
    ...

```

[!info]- Task - Problem Definition ``` class Task: """Base class for optimization tasks."""

def __init__(
    self,
    name: str,                         # Task name
    instance: str | dict,              # Problem instance data
    metric: str,                       # Primary metric name
    direction: str = "minimize",       # "minimize" or "maximize"
    function_signature: str = None,    # Required function signature
    imports: list[str] = None,         # Allowed imports
    timeout: int = 30,                 # Per-evaluation timeout
):
    ...

def evaluate(self, code: str) -> EvalResult:
    """Evaluate a candidate algorithm."""
    ...

def get_prompt_context(self) -> str:
    """Return task description for LLM prompts."""
    ...

```

[!info]- OptimizationResult - Output ``` class OptimizationResult: best_code: str # Best algorithm code best_score: float # Best score achieved best_thought: str | None # Best algorithmic idea (EoH/MEoH) history: list[EvaluationRecord] # Full evaluation history population: list[Individual] # Final population stats: RunStats # Runtime statistics

def plot_convergence(self, save_path: str = None):
    """Plot convergence curve."""
    ...

def plot_population_diversity(self, save_path: str = None):
    """Plot population diversity over time."""
    ...

def export_csv(self, path: str):
    """Export history to CSV."""
    ...

def export_latex_table(self) -> str:
    """Generate LaTeX table of results."""
    ...

```

Full Example: Comparing Methods on TSP

from llm4ad import LLM4AD
from llm4ad.methods import EoH, FunSearch, ReEvo, MCTSAHD, EPS
from llm4ad.tasks import TSP
from llm4ad.llm import OpenAIBackend

llm = OpenAIBackend(model="gpt-4o")
task = TSP(instance="tsplib/eil51", metric="tour_length", direction="minimize")

methods = {
    "EoH": EoH(population_size=20, num_generations=50),
    "FunSearch": FunSearch(num_islands=5, programs_per_island=20),
    "ReEvo": ReEvo(population_size=15, num_generations=60),
    "MCTS-AHD": MCTSAHD(max_depth=8, num_simulations=200),
    "(1+1)-EPS": EPS(max_iterations=200),
}

results = {}
for name, method in methods.items():
    print(f"\n--- Running {name} ---")
    runner = LLM4AD(method=method, task=task, llm=llm, max_workers=4)
    results[name] = runner.run()
    print(f"{name}: best tour length = {results[name].best_score:.2f}")

# Compare results
print("\n=== Comparison ===")
for name, result in sorted(results.items(), key=lambda x: x[1].best_score):
    print(f"{name:15s}: {result.best_score:.2f} (evals: {result.stats.total_evaluations})")

9. Results & Benchmarks

Bin Packing Results (Falkenauer Triplet Instances)

Method LLM Waste (%) Gap to BKS (%) Evaluations Time (min)
EoH GPT-4o 1.23 0.08 1,000 45
MEoH GPT-4o 1.20 0.05 1,500 72
FunSearch GPT-4o 1.35 0.20 5,000 180
ReEvo GPT-4o 1.18 0.03 900 55
MCTS-AHD GPT-4o 1.25 0.10 800 60
(1+1)-EPS GPT-4o 1.30 0.15 200 20
LLaMEA GPT-4o 1.40 0.25 300 35
Best Known 1.15 0.00

TSP Results (TSPLIB eil51, att48, kroA100)

Method eil51 Gap% att48 Gap% kroA100 Gap% Avg Gap%
EoH 2.1 1.8 3.5 2.47
ReEvo 1.8 1.5 2.9 2.07
MCTS-AHD 2.0 1.7 3.2 2.30
FunSearch 2.5 2.2 4.1 2.93
(1+1)-EPS 2.8 2.5 4.5 3.27

LLM Comparison (EoH on Bin Packing)

LLM Waste (%) Cost per Run Eval Rate Success Rate
GPT-4o 1.23 $15.20 22 eval/min 85%
Claude Sonnet 4 1.21 $18.50 18 eval/min 88%
Gemini 2.0 Pro 1.25 $12.30 25 eval/min 82%
DeepSeek V3 1.28 $3.50 15 eval/min 78%
Llama 3.1 70B (local) 1.45 $0 (GPU cost) 8 eval/min 65%

10. Circle Packing World Record

**World Record:** LLM4AD achieved a new world record for circle packing at n=26, with a score of **2.63594**, surpassing previous best results from AlphaEvolve and classical optimization methods.

Problem Definition

Given n unit circles, find the arrangement that minimizes the radius of the enclosing circle. Equivalently, maximize the ratio of the sum of radii to the enclosing radius.

$$ maximize R = (∑i=1^n ri) / renclosing $$

Method Configuration

from llm4ad import LLM4AD
from llm4ad.methods import FunSearch
from llm4ad.tasks import CirclePacking
from llm4ad.llm import GoogleBackend

task = CirclePacking(
    n=26,
    metric="packing_ratio",
    direction="maximize",
    evaluator="exact",             # Use exact geometric computation
    timeout_per_eval=120,          # Allow longer evals for large n
)

method = FunSearch(
    num_islands=15,
    programs_per_island=100,
    num_samplers=8,
    samples_per_prompt=8,
    temperature=1.0,
    top_k_for_prompt=3,
    reset_period=200,
)

llm = GoogleBackend(model="gemini-2.0-pro")
runner = LLM4AD(method=method, task=task, llm=llm, max_workers=16)
result = runner.run()

print(f"Best packing ratio: {result.best_score:.5f}")  # 2.63594

Circle Packing Results by n

n Previous Best LLM4AD Result Method Evaluations
10 2.38660 2.38660 Matched optimal 500
15 2.47540 2.47540 Matched optimal 1,200
20 2.52040 2.52042 Slight improvement 3,000
22 2.56287 2.56290 Improvement 5,000
24 2.60240 2.60248 Improvement 8,000
26 2.63590 2.63594 World record 15,000

11. Comparison with Other Platforms

Feature LLM4AD AlphaEvolve OpenEvolve GEPA ShinkaEvolve
Open Source MIT No Yes Yes Apache 2.0
Search Methods 7 methods 1 (custom) 1 (custom) 1 (Pareto) 1 (custom)
Built-in Tasks 12+ tasks Custom only Custom only Custom only Custom only
LLM Backends 6 backends Google only Multi Multi Multi
GUI Yes No No No Yes
Multi-Objective MEoH Weighted Weighted Pareto No
Resumable Yes Yes Yes Yes Yes
W&B/TB Logging Both No No No Custom

12. Limitations & Future Work

Current Limitations

  • Python Only: Generated algorithms are restricted to Python. No support for C++, Rust, or Julia algorithms that could be significantly faster.
  • Limited Task Complexity: Built-in tasks cover standard benchmarks but not full-scale industrial problems with complex constraints.
  • LLM Cost: Running 7 methods across many tasks for fair comparison is expensive. A complete benchmark suite can cost hundreds of dollars in LLM API calls.
  • No Cross-Method Ensembling: Currently no mechanism to combine insights from multiple search methods during a single run.
  • Evaluation Bottleneck: For expensive-to-evaluate tasks (CFD, RL), evaluation rather than LLM calls becomes the bottleneck.

Planned Features

  • Method Ensembling: Run multiple search methods in parallel, sharing promising candidates across methods.
  • Distributed Execution: Distribute evaluation across multiple machines via Ray or Dask.
  • Auto-Method Selection: Automatically select the best search method for a given task based on problem characteristics.
  • Multi-Language Support: Generate and evaluate algorithms in C++, Julia, and Rust for performance-critical applications.
  • Benchmark Hub: Community-contributed tasks and benchmark results, similar to HuggingFace model hub.
  • Meta-Learning: Learn from past optimization runs to warm-start new runs on similar problems.

**Summary:** LLM4AD is the most comprehensive platform for LLM-based automatic algorithm design. Its 7 integrated search methods, 12+ built-in tasks, multiple LLM backends, GUI, and production features (logging, resumability, multiprocessing) make it the go-to platform for both researchers comparing methods and practitioners solving real-world optimization problems.

← Back to Index