LLM4AD

Unified Open-Source Platform for LLM-based Automatic Algorithm Design Authors: City University of Hong Kong & Southern University of Science and Technology Repository: github.com/Optima-CityU/llm4ad License: MIT Python: 3.9–3.12

**Key Contribution:** LLM4AD provides a unified platform integrating 7 state-of-the-art search methods for automatic algorithm design, supporting 12+ combinatorial optimization tasks and multiple LLM backends. It achieves a world record for circle packing at n=26 with score 2.63594.

Overview & Motivation
Installation & Quick Start
Platform Architecture
Search Methods (7 Algorithms)
EoH — Evolution of Heuristics (ICML 2024)
MEoH — Multi-objective EoH (AAAI 2025)
FunSearch (Nature 2024)
ReEvo — Reflective Evolution (NeurIPS 2024)
MCTS-AHD (ICML 2025)
(1+1)-EPS (PPSN 2024)
LLaMEA (IEEE TEVC 2025)
Supported Tasks & Benchmarks
LLM Backend Support
Platform Features
API Reference & Code Examples
Results & Benchmarks
Circle Packing World Record
Comparison with Other Platforms
Limitations & Future Work

1. Overview & Motivation

LLM4AD (Large Language Models for Automatic Algorithm Design) addresses the fragmentation problem in the LLM-for-optimization research landscape. Each published method (EoH, FunSearch, ReEvo, etc.) comes with its own codebase, API, evaluation harness, and LLM integration. Researchers wanting to compare methods must re-implement or adapt each system independently. LLM4AD provides a unified platform where all methods share:

A common interface for defining optimization problems (tasks)
A unified LLM abstraction layer supporting GPT-4o, Gemini Pro, DeepSeek, and local models (Llama, Gemma)
Shared evaluation infrastructure with multiprocessing, timeouts, and caching
Consistent logging and visualization via W&B and TensorBoard
A graphical user interface (GUI) for interactive experimentation
Resumable runs with checkpoint/restart capability

Design Philosophy

Modular & Extensible

Each search method, task, and LLM backend is a pluggable module. Adding a new search method requires implementing a single interface. Adding a new task requires defining an evaluation function and problem specification.

Reproducible & Fair

All methods run under identical conditions — same LLM, same evaluation budget, same hardware. This enables fair apples-to-apples comparison that is impossible when methods use different codebases.

Practical & Production-Ready

Beyond research, LLM4AD is designed for real-world use. Multiprocessing, timeout handling, GPU-aware scheduling, and resumable runs make it suitable for long-running optimization campaigns.

Educational & Accessible

The GUI mode and comprehensive documentation lower the barrier to entry. Researchers can experiment with different methods without deep understanding of each algorithm's internals.

2. Installation & Quick Start

Installation

# Basic installation
pip install llm4ad

# With GUI support
pip install llm4ad[gui]

# With all LLM backends
pip install llm4ad[all]

# Development installation
git clone https://github.com/Optima-CityU/llm4ad.git
cd llm4ad
pip install -e ".[dev]"

Quick Start: Bin Packing with EoH

from llm4ad import LLM4AD, EoH, BinPacking
from llm4ad.llm import OpenAIBackend

# Configure LLM backend
llm = OpenAIBackend(model="gpt-4o", api_key="sk-...")

# Define the task
task = BinPacking(
    instance="benchmark/bpp_500_items",
    metric="waste_ratio",
    direction="minimize",
)

# Select search method
method = EoH(
    population_size=20,
    num_generations=50,
    crossover_rate=0.3,
    mutation_rate=0.7,
    elite_size=3,
)

# Run optimization
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    max_workers=4,
    timeout_per_eval=30,
    log_to="wandb",          # Log to Weights & Biases
    project_name="llm4ad-bpp",
)

result = runner.run()

print(f"Best waste ratio: {result.best_score:.4f}")
print(f"Best algorithm:\n{result.best_code}")

GUI Mode

# Launch the graphical interface
llm4ad gui --port 8080

# Or from Python
from llm4ad.gui import launch
launch(port=8080)

3. Platform Architecture

+-----------------------------------------------------------------------------------+
|                            LLM4AD Platform Architecture                            |
+-----------------------------------------------------------------------------------+
|                                                                                    |
|  +----------------------------------+   +--------------------------------------+   |
|  |          User Interface           |   |          Configuration Layer          |   |
|  |                                   |   |                                      |   |
|  |  CLI  |  Python API  |  GUI (Web) |   |  YAML Config  |  Programmatic API   |   |
|  +----------------------------------+   +--------------------------------------+   |
|                        |                              |                            |
|                        v                              v                            |
|  +------------------------------------------------------------------------+        |
|  |                       LLM4AD Orchestrator                               |        |
|  |                                                                         |        |
|  |   +-------------------+    +-------------------+    +----------------+  |        |
|  |   | Search Method     |    | Task Registry     |    | LLM Backend    |  |        |
|  |   | Selector          |    |                   |    | Manager        |  |        |
|  |   |                   |    | - BinPacking      |    |                |  |        |
|  |   | - EoH             |    | - TSP             |    | - OpenAI       |  |        |
|  |   | - MEoH            |    | - FacilityLoc     |    | - Google       |  |        |
|  |   | - FunSearch        |    | - Knapsack        |    | - DeepSeek     |  |        |
|  |   | - ReEvo           |    | - QAP             |    | - Local (vLLM) |  |        |
|  |   | - MCTS-AHD        |    | - Scheduling      |    |                |  |        |
|  |   | - (1+1)-EPS       |    | - BO Acquisition  |    +----------------+  |        |
|  |   | - LLaMEA          |    | - RL Environments |                        |        |
|  |   +-------------------+    | - CFD Turbulence  |                        |        |
|  |           |                | - Bacteria Growth |                        |        |
|  |           |                | - Math Discovery  |                        |        |
|  |           |                +-------------------+                        |        |
|  |           v                         |                                   |        |
|  |   +------------------------------------------------------------------+ |        |
|  |   |                    Evaluation Engine                              | |        |
|  |   |                                                                   | |        |
|  |   |  [Multiprocessing Pool] --> [Timeout Guard] --> [Score Collect]  | |        |
|  |   |        |                        |                      |          | |        |
|  |   |   max_workers=N           per-eval timeout         Caching        | |        |
|  |   |   Process isolation       Graceful kill            Content hash   | |        |
|  |   +------------------------------------------------------------------+ |        |
|  |                              |                                          |        |
|  |                              v                                          |        |
|  |   +------------------------------------------------------------------+ |        |
|  |   |                    Logging & Visualization                        | |        |
|  |   |                                                                   | |        |
|  |   |  W&B Integration  |  TensorBoard  |  CSV Export  |  Checkpoints  | |        |
|  |   +------------------------------------------------------------------+ |        |
|  +------------------------------------------------------------------------+        |
+-----------------------------------------------------------------------------------+

Key Architectural Decisions

Plugin Architecture: Search methods, tasks, and LLM backends are all registered via a plugin system. New components can be added without modifying core platform code.
Shared Evaluation: All search methods use the same evaluation engine, ensuring identical sandboxing, timing, and scoring across methods.
Checkpoint/Resume: The orchestrator periodically checkpoints the entire state (population, history, random seeds), enabling resumable runs after crashes or interruptions.
Process Isolation: Each evaluation runs in a separate process with strict timeout enforcement, preventing infinite loops or memory leaks from crashing the main process.

4. Search Methods (7 Algorithms)

4.1 EoH — Evolution of Heuristics (ICML 2024)

**Paper:** "Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model"

**Venue:** `ICML 2024`

EoH evolves both the algorithmic idea (natural language description) and the code implementation simultaneously. This dual representation enables the LLM to reason about high-level algorithmic concepts while grounding them in executable code.

Key Mechanism: Thought-Code Co-Evolution

Each individual in the population is a (thought, code) pair
Mutation operates on both: the thought is mutated first, then the code is updated to match
Crossover combines thoughts from two parents, then generates code for the combined thought
Selection is based on code execution performance, but thought quality influences mutation quality

from llm4ad.methods import EoH

method = EoH(
    population_size=20,            # Population of (thought, code) pairs
    num_generations=50,            # Number of evolutionary generations
    crossover_rate=0.3,            # Probability of crossover vs mutation
    mutation_rate=0.7,             # Probability of mutation
    elite_size=3,                  # Top-3 preserved across generations
    thought_mutation_temp=0.9,     # Higher temp for thought diversity
    code_mutation_temp=0.7,        # Lower temp for code precision
    tournament_size=3,             # Tournament selection size
)

EoH Operators

Operator	Description	Input	Output
e1: Thought Mutation	LLM modifies the algorithmic idea	Parent thought + task description	New thought + corresponding code
e2: Code Mutation	LLM modifies the code while preserving the thought	Parent thought + parent code	Same thought + modified code
c1: Thought Crossover	LLM combines ideas from two parents	Two parent thoughts	New thought + corresponding code
c2: Code Crossover	LLM combines code from two parents	Two parent (thought, code) pairs	Combined thought + combined code

4.2 MEoH — Multi-objective Evolution of Heuristics (AAAI 2025)

**Paper:** "Multi-objective Evolution of Heuristics Using Large Language Models"

**Venue:** `AAAI 2025`

MEoH extends EoH to handle multiple conflicting objectives simultaneously, maintaining a Pareto frontier of non-dominated (thought, code) pairs.

Key Extensions over EoH

Multi-objective evaluation: Each candidate is scored on multiple metrics (e.g., solution quality + runtime)
Pareto-based selection: Non-dominated sorting + crowding distance for diverse frontier exploration
Objective-aware mutation: The LLM is told which objectives are lagging and asked to focus improvements there
Hypervolume tracking: Progress measured by hypervolume indicator rather than single best score

from llm4ad.methods import MEoH

method = MEoH(
    population_size=30,
    num_generations=100,
    objectives=["quality", "runtime"],
    directions=["maximize", "minimize"],
    crossover_rate=0.3,
    mutation_rate=0.7,
    reference_point=[0.0, 1000.0],   # For hypervolume computation
    objective_focus_strategy="lagging", # Focus mutation on lagging objectives
)

4.3 FunSearch (Nature 2024)

**Paper:** "Mathematical discoveries from program search with large language models"

**Venue:** `Nature 2024`

FunSearch, developed by DeepMind, uses a best-shot sampling approach: generate many candidates from the LLM, evaluate all of them, and keep only the best. It uses an island model with periodic migration to maintain population diversity.

FunSearch Architecture

Sampler Pool: Multiple LLM samplers generate candidates in parallel
Programs Database: Island-structured database storing candidates, organized by score tiers
Evaluator Pool: Parallel evaluators score candidates, with strict timeouts
Best-shot Strategy: From each island, the top-k candidates are selected as prompting examples for the LLM

from llm4ad.methods import FunSearch

method = FunSearch(
    num_islands=10,                # Number of islands
    programs_per_island=50,        # Population per island
    num_samplers=4,                # Parallel LLM samplers
    samples_per_prompt=4,          # Candidates per LLM call
    temperature=1.0,               # High temperature for diversity
    top_k_for_prompt=2,            # Best-k examples in prompt
    reset_period=100,              # Reset worst island every N evals
    migration_interval=50,         # Migrate between islands
)

4.4 ReEvo — Reflective Evolution (NeurIPS 2024)

**Paper:** "ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution"

**Venue:** `NeurIPS 2024`

ReEvo introduces reflective evolution where the LLM not only generates mutations but also reflects on why previous candidates succeeded or failed. This self-reflective capability produces more targeted and effective mutations.

Reflection Loop

Generate: LLM produces a candidate heuristic
Evaluate: Run the candidate on the benchmark
Reflect: LLM analyzes the evaluation results, identifying strengths and weaknesses
Refine: LLM uses the reflection to propose a targeted improvement
Repeat: The refined candidate becomes the new parent

from llm4ad.methods import ReEvo

method = ReEvo(
    population_size=15,
    num_generations=60,
    reflection_depth=2,            # Number of reflection iterations per mutation
    short_term_memory=5,           # Recent reflections kept in context
    long_term_memory=20,           # Total reflections stored
    reflection_temperature=0.7,    # Temperature for reflection LLM calls
    mutation_temperature=0.8,      # Temperature for mutation LLM calls
)

Reflection Prompt Structure

# ReEvo reflection prompt (simplified)
"""
## Current Heuristic
{code}

## Evaluation Results
Score: {score}
Test case breakdown:
{per_case_results}

## Previous Reflections
{recent_reflections}

## Task
1. Analyze WHY this heuristic scored {score}
2. Identify the specific weakness causing failures
3. Propose a concrete improvement strategy
4. Explain how the improvement addresses the weakness

## Your Reflection:
"""

4.5 MCTS-AHD — Monte Carlo Tree Search for Algorithm Design (ICML 2025)

**Paper:** "Monte Carlo Tree Search for Automatic Heuristic Design"

**Venue:** `ICML 2025`

MCTS-AHD frames algorithm design as a sequential decision problem and uses Monte Carlo Tree Search to explore the space of algorithmic building blocks. Each node in the tree represents an algorithmic component choice.

MCTS Formulation

State: Partially specified algorithm (sequence of component choices so far)
Action: Choose the next algorithmic component (data structure, operator, control flow)
Reward: Evaluation score of the completed algorithm
UCB Selection: Balance exploration of new component choices vs exploitation of known-good paths

$$ UCB(s, a) = Q(s, a) + c ⋅ √(ln N(s) / N(s, a)) $$

Where Q(s,a) is the average reward for taking action a in state s, N(s) is the visit count for state s, N(s,a) is the visit count for action a in state s, and c is the exploration constant.

from llm4ad.methods import MCTSAHD

method = MCTSAHD(
    max_depth=8,                   # Maximum tree depth (algorithm complexity)
    num_simulations=200,           # MCTS simulations per step
    exploration_constant=1.414,    # UCB exploration parameter (sqrt(2))
    expansion_width=5,             # Number of children per expansion
    rollout_policy="llm",          # Use LLM for rollout (vs random)
    backprop_strategy="max",       # Backpropagate max score (vs average)
)

4.6 (1+1)-EPS — Evolutionary Program Search (PPSN 2024)

**Paper:** "(1+1)-Evolutionary Program Search"

**Venue:** `PPSN 2024`

(1+1)-EPS is the simplest method: it maintains a single solution and iteratively mutates it, keeping the mutation only if it improves the score. Despite its simplicity, it is surprisingly effective for well-defined optimization tasks.

Algorithm

# (1+1)-EPS pseudocode
def eps_search(initial_solution, evaluate, llm, max_iterations):
    current = initial_solution
    current_score = evaluate(current)

    for i in range(max_iterations):
        # Mutate the current solution using LLM
        mutant = llm.mutate(current, feedback=get_feedback(current))

        # Evaluate the mutant
        mutant_score = evaluate(mutant)

        # Accept if improved (greedy)
        if mutant_score >= current_score:
            current = mutant
            current_score = mutant_score
            log(f"Iteration {i}: improved to {current_score}")

    return current, current_score

from llm4ad.methods import EPS

method = EPS(
    max_iterations=200,            # Total mutation attempts
    mutation_temperature=0.8,      # LLM temperature for mutations
    include_feedback=True,         # Include eval feedback in mutation prompt
    feedback_window=5,             # Include last 5 mutations in context
    restart_on_stagnation=True,    # Restart from best-so-far after N stagnant iters
    stagnation_threshold=30,       # Iterations without improvement to trigger restart
)

4.7 LLaMEA (IEEE TEVC 2025)

**Paper:** "LLaMEA: A Large Language Model Evolutionary Algorithm for Automatically Generating Metaheuristics"

**Venue:** `IEEE TEVC 2025`

LLaMEA generates entire metaheuristic algorithms (not just heuristics for specific problems). The LLM is prompted to generate complete evolutionary or swarm-based algorithms, which are then evaluated on a portfolio of benchmark problems.

Key Difference from Other Methods

While EoH/ReEvo evolve problem-specific heuristics, LLaMEA evolves general-purpose optimization algorithms that can be applied to any problem. The output is a complete metaheuristic (like a new variant of PSO or DE) rather than a heuristic for TSP.

from llm4ad.methods import LLaMEA

method = LLaMEA(
    population_size=10,
    num_generations=30,
    algorithm_template="metaheuristic",  # Generate full metaheuristics
    benchmark_portfolio=[                # Evaluate on multiple problems
        "sphere_d10", "rastrigin_d10",
        "rosenbrock_d10", "ackley_d10",
    ],
    aggregation="geometric_mean",        # Aggregate scores across benchmarks
    mutation_strategy="component_swap",  # Swap algorithmic components
)

Search Methods Comparison

Method	Venue	Population	Key Mechanism	Multi-Obj	Reflection	Best For
EoH	ICML 2024	20–50	Thought-code co-evolution	No	No	General heuristics
MEoH	AAAI 2025	30–100	Pareto + objective-aware mutation	`Yes`	No	Multi-objective problems
FunSearch	Nature 2024	Islands	Best-shot sampling + islands	No	No	Mathematical discovery
ReEvo	NeurIPS 2024	10–30	Self-reflective evolution	No	`Yes`	Complex heuristics
MCTS-AHD	ICML 2025	Tree	UCB-guided component selection	No	No	Compositional algorithms
(1+1)-EPS	PPSN 2024	1	Greedy hill climbing	No	Partial	Quick prototyping
LLaMEA	IEEE TEVC 2025	10–20	Metaheuristic generation	No	No	Algorithm generation

5. Supported Tasks & Benchmarks

Combinatorial Optimization

Task	Description	Metric	Instances
Bin Packing	Pack items into fixed-capacity bins minimizing waste	Waste ratio (lower is better)	Falkenauer, Scholl, random (50–5000 items)
TSP	Traveling Salesman Problem — find shortest tour	Tour length (lower is better)	TSPLIB (14–2392 cities), random
Facility Location	Place facilities to minimize total transportation cost	Total cost (lower is better)	OR-Library, random (10–500 facilities)
Knapsack	Select items maximizing value within weight capacity	Total value (higher is better)	Pisinger, random (50–10000 items)
QAP	Quadratic Assignment — assign facilities to locations	Total flow*distance (lower is better)	QAPLIB (12–256 facilities)
Scheduling	Job-shop and flow-shop scheduling	Makespan (lower is better)	Taillard, random (10x5 to 100x20)

Machine Learning & AI

Task	Description	Metric	Instances
BO Acquisition	Design acquisition functions for Bayesian Optimization	Regret (lower is better)	Branin, Hartmann, Levy functions
RL Environments	Design reward shaping or policy heuristics	Cumulative reward (higher is better)	CartPole, MountainCar, LunarLander

Scientific Computing

Task	Description	Metric	Instances
CFD Turbulence	Design turbulence models for computational fluid dynamics	Prediction error vs DNS (lower is better)	Channel flow, flat plate, backward-facing step
Bacteria Growth	Design growth rate models for bacterial colonies	Fit to experimental data (higher R²)	E. coli, S. cerevisiae datasets

Mathematical Discovery

Task	Description	Metric	Notes
Circle Packing	Pack n circles in minimum enclosing circle	Ratio of sum of radii to enclosing radius (higher is better)	n=5 to n=30
Cap Set Discovery	Find large cap sets in GF(3)^n	Cap set size (higher is better)	n=4 to n=8
Extremal Combinatorics	Construct extremal graph colorings	Problem-specific score	Various open problems

6. LLM Backend Support

Backend	Models	Type	Configuration
OpenAI	GPT-4o, GPT-4o-mini, o1, o3	Cloud API	`OpenAIBackend(model="gpt-4o", api_key="...")`
Google	Gemini Pro, Gemini Flash, Gemini 2.5	Cloud API	`GoogleBackend(model="gemini-2.0-pro", api_key="...")`
DeepSeek	DeepSeek V3, DeepSeek R1	Cloud API	`DeepSeekBackend(model="deepseek-v3", api_key="...")`
Anthropic	Claude Sonnet 4, Claude Opus 4	Cloud API	`AnthropicBackend(model="claude-sonnet-4-20250514", api_key="...")`
Local (vLLM)	Llama 3.1, Gemma 2, Mistral, Qwen	Local	`VLLMBackend(model_path="meta-llama/Llama-3.1-70B")`
Local (Ollama)	Any Ollama-supported model	Local	`OllamaBackend(model="llama3.1:70b")`

Custom Backend Integration

from llm4ad.llm import LLMBackend, LLMResponse

class MyBackend(LLMBackend):
    """Custom LLM backend for LLM4AD."""

    async def generate(self, prompt: str, **kwargs) -> LLMResponse:
        response = await my_api(prompt, **kwargs)
        return LLMResponse(
            text=response.text,
            tokens_used=response.total_tokens,
            model=self.model_name,
        )

    def get_model_info(self) -> dict:
        return {
            "name": self.model_name,
            "context_window": 128000,
            "supports_system_prompt": True,
        }

7. Platform Features

GUI Interface

LLM4AD includes a web-based GUI for interactive experimentation:

Method Configuration: Visual parameter tuning for each search method
Real-time Monitoring: Live convergence plots, population diversity metrics, and cost tracking
Result Comparison: Side-by-side comparison of multiple runs
Code Browser: Inspect generated algorithms with syntax highlighting
Export: Export results as CSV, JSON, or LaTeX tables

Logging Integration

# Weights & Biases logging
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to="wandb",
    project_name="my-experiment",
    run_name="eoh-binpacking-v1",
    tags=["eoh", "binpacking", "gpt4o"],
)

# TensorBoard logging
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to="tensorboard",
    log_dir="./runs/experiment-1",
)

# Both simultaneously
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    log_to=["wandb", "tensorboard"],
)

Multiprocessing with Timeout

runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    max_workers=8,            # 8 parallel evaluation workers
    timeout_per_eval=60,      # Kill evaluation after 60 seconds
    memory_limit_mb=2048,     # 2GB memory limit per evaluation
    retry_on_timeout=True,    # Retry timed-out evaluations once
    gpu_allocation={          # GPU assignment per worker
        0: [0, 1],           # Workers 0-1 use GPU 0
        1: [2, 3],           # Workers 2-3 use GPU 1
    },
)

Resumable Runs

# Start a run with checkpointing
runner = LLM4AD(
    method=method,
    task=task,
    llm=llm,
    checkpoint_dir="./checkpoints/exp-1",
    checkpoint_interval=10,     # Checkpoint every 10 generations
)
result = runner.run()

# Resume from checkpoint after crash/interruption
runner = LLM4AD.resume("./checkpoints/exp-1")
result = runner.run()  # Continues from last checkpoint

8. API Reference & Code Examples

Core Classes

[!info]- LLM4AD - Main Orchestrator ``` class LLM4AD: def init( self, method: SearchMethod, # Search algorithm to use task: Task, # Optimization task llm: LLMBackend, # LLM backend max_workers: int = 4, # Parallel evaluation workers timeout_per_eval: int = 60, # Evaluation timeout (seconds) log_to: str | list = None, # Logging backend(s) checkpoint_dir: str = None, # Directory for checkpoints checkpoint_interval: int = 10, # Checkpoint frequency seed: int = 42, # Random seed verbose: bool = True, # Print progress ): ...
def run(self) -> OptimizationResult:
    """Run the optimization until stopping criteria are met."""
    ...

@classmethod
def resume(cls, checkpoint_dir: str) -> "LLM4AD":
    """Resume from a checkpoint."""
    ...
```

[!info]- Task - Problem Definition ``` class Task: """Base class for optimization tasks."""
def __init__(
    self,
    name: str,                         # Task name
    instance: str | dict,              # Problem instance data
    metric: str,                       # Primary metric name
    direction: str = "minimize",       # "minimize" or "maximize"
    function_signature: str = None,    # Required function signature
    imports: list[str] = None,         # Allowed imports
    timeout: int = 30,                 # Per-evaluation timeout
):
    ...

def evaluate(self, code: str) -> EvalResult:
    """Evaluate a candidate algorithm."""
    ...

def get_prompt_context(self) -> str:
    """Return task description for LLM prompts."""
    ...
```

[!info]- OptimizationResult - Output ``` class OptimizationResult: best_code: str # Best algorithm code best_score: float # Best score achieved best_thought: str | None # Best algorithmic idea (EoH/MEoH) history: list[EvaluationRecord] # Full evaluation history population: list[Individual] # Final population stats: RunStats # Runtime statistics
def plot_convergence(self, save_path: str = None):
    """Plot convergence curve."""
    ...

def plot_population_diversity(self, save_path: str = None):
    """Plot population diversity over time."""
    ...

def export_csv(self, path: str):
    """Export history to CSV."""
    ...

def export_latex_table(self) -> str:
    """Generate LaTeX table of results."""
    ...
```

Full Example: Comparing Methods on TSP

from llm4ad import LLM4AD
from llm4ad.methods import EoH, FunSearch, ReEvo, MCTSAHD, EPS
from llm4ad.tasks import TSP
from llm4ad.llm import OpenAIBackend

llm = OpenAIBackend(model="gpt-4o")
task = TSP(instance="tsplib/eil51", metric="tour_length", direction="minimize")

methods = {
    "EoH": EoH(population_size=20, num_generations=50),
    "FunSearch": FunSearch(num_islands=5, programs_per_island=20),
    "ReEvo": ReEvo(population_size=15, num_generations=60),
    "MCTS-AHD": MCTSAHD(max_depth=8, num_simulations=200),
    "(1+1)-EPS": EPS(max_iterations=200),
}

results = {}
for name, method in methods.items():
    print(f"\n--- Running {name} ---")
    runner = LLM4AD(method=method, task=task, llm=llm, max_workers=4)
    results[name] = runner.run()
    print(f"{name}: best tour length = {results[name].best_score:.2f}")

# Compare results
print("\n=== Comparison ===")
for name, result in sorted(results.items(), key=lambda x: x[1].best_score):
    print(f"{name:15s}: {result.best_score:.2f} (evals: {result.stats.total_evaluations})")

9. Results & Benchmarks

Bin Packing Results (Falkenauer Triplet Instances)

Method	LLM	Waste (%)	Gap to BKS (%)	Evaluations	Time (min)
EoH	GPT-4o	1.23	0.08	1,000	45
MEoH	GPT-4o	1.20	0.05	1,500	72
FunSearch	GPT-4o	1.35	0.20	5,000	180
ReEvo	GPT-4o	1.18	0.03	900	55
MCTS-AHD	GPT-4o	1.25	0.10	800	60
(1+1)-EPS	GPT-4o	1.30	0.15	200	20
LLaMEA	GPT-4o	1.40	0.25	300	35
Best Known	—	1.15	0.00	—	—

TSP Results (TSPLIB eil51, att48, kroA100)

Method	eil51 Gap%	att48 Gap%	kroA100 Gap%	Avg Gap%
EoH	2.1	1.8	3.5	2.47
ReEvo	1.8	1.5	2.9	2.07
MCTS-AHD	2.0	1.7	3.2	2.30
FunSearch	2.5	2.2	4.1	2.93
(1+1)-EPS	2.8	2.5	4.5	3.27

LLM Comparison (EoH on Bin Packing)

LLM	Waste (%)	Cost per Run	Eval Rate	Success Rate
GPT-4o	1.23	$15.20	22 eval/min	85%
Claude Sonnet 4	1.21	$18.50	18 eval/min	88%
Gemini 2.0 Pro	1.25	$12.30	25 eval/min	82%
DeepSeek V3	1.28	$3.50	15 eval/min	78%
Llama 3.1 70B (local)	1.45	$0 (GPU cost)	8 eval/min	65%

10. Circle Packing World Record

**World Record:** LLM4AD achieved a new world record for circle packing at n=26, with a score of **2.63594**, surpassing previous best results from AlphaEvolve and classical optimization methods.

Problem Definition

Given n unit circles, find the arrangement that minimizes the radius of the enclosing circle. Equivalently, maximize the ratio of the sum of radii to the enclosing radius.

$$ maximize R = (∑i=1^n ri) / renclosing $$

Method Configuration

from llm4ad import LLM4AD
from llm4ad.methods import FunSearch
from llm4ad.tasks import CirclePacking
from llm4ad.llm import GoogleBackend

task = CirclePacking(
    n=26,
    metric="packing_ratio",
    direction="maximize",
    evaluator="exact",             # Use exact geometric computation
    timeout_per_eval=120,          # Allow longer evals for large n
)

method = FunSearch(
    num_islands=15,
    programs_per_island=100,
    num_samplers=8,
    samples_per_prompt=8,
    temperature=1.0,
    top_k_for_prompt=3,
    reset_period=200,
)

llm = GoogleBackend(model="gemini-2.0-pro")
runner = LLM4AD(method=method, task=task, llm=llm, max_workers=16)
result = runner.run()

print(f"Best packing ratio: {result.best_score:.5f}")  # 2.63594

Circle Packing Results by n

n	Previous Best	LLM4AD Result	Method	Evaluations
10	2.38660	2.38660	Matched optimal	500
15	2.47540	2.47540	Matched optimal	1,200
20	2.52040	2.52042	Slight improvement	3,000
22	2.56287	2.56290	Improvement	5,000
24	2.60240	2.60248	Improvement	8,000
26	2.63590	2.63594	World record	15,000

11. Comparison with Other Platforms

Feature	LLM4AD	AlphaEvolve	OpenEvolve	GEPA	ShinkaEvolve
Open Source	`MIT`	No	`Yes`	`Yes`	`Apache 2.0`
Search Methods	7 methods	1 (custom)	1 (custom)	1 (Pareto)	1 (custom)
Built-in Tasks	12+ tasks	Custom only	Custom only	Custom only	Custom only
LLM Backends	6 backends	Google only	Multi	Multi	Multi
GUI	`Yes`	No	No	No	`Yes`
Multi-Objective	`MEoH`	Weighted	Weighted	`Pareto`	No
Resumable	`Yes`	Yes	Yes	Yes	Yes
W&B/TB Logging	`Both`	No	No	No	Custom

12. Limitations & Future Work

Current Limitations

Python Only: Generated algorithms are restricted to Python. No support for C++, Rust, or Julia algorithms that could be significantly faster.
Limited Task Complexity: Built-in tasks cover standard benchmarks but not full-scale industrial problems with complex constraints.
LLM Cost: Running 7 methods across many tasks for fair comparison is expensive. A complete benchmark suite can cost hundreds of dollars in LLM API calls.
No Cross-Method Ensembling: Currently no mechanism to combine insights from multiple search methods during a single run.
Evaluation Bottleneck: For expensive-to-evaluate tasks (CFD, RL), evaluation rather than LLM calls becomes the bottleneck.

Planned Features

Method Ensembling: Run multiple search methods in parallel, sharing promising candidates across methods.
Distributed Execution: Distribute evaluation across multiple machines via Ray or Dask.
Auto-Method Selection: Automatically select the best search method for a given task based on problem characteristics.
Multi-Language Support: Generate and evaluate algorithms in C++, Julia, and Rust for performance-critical applications.
Benchmark Hub: Community-contributed tasks and benchmark results, similar to HuggingFace model hub.
Meta-Learning: Learn from past optimization runs to warm-start new runs on similar problems.

**Summary:** LLM4AD is the most comprehensive platform for LLM-based automatic algorithm design. Its 7 integrated search methods, 12+ built-in tasks, multiple LLM backends, GUI, and production features (logging, resumability, multiprocessing) make it the go-to platform for both researchers comparing methods and practitioners solving real-world optimization problems.

← Back to Index

LLM4AD

Table of Contents

1. Overview & Motivation

Design Philosophy

Modular & Extensible

Reproducible & Fair

Practical & Production-Ready

Educational & Accessible

2. Installation & Quick Start

Installation

Quick Start: Bin Packing with EoH

GUI Mode

3. Platform Architecture

Key Architectural Decisions

4. Search Methods (7 Algorithms)

4.1 EoH — Evolution of Heuristics (ICML 2024)

Key Mechanism: Thought-Code Co-Evolution

EoH Operators

4.2 MEoH — Multi-objective Evolution of Heuristics (AAAI 2025)

Key Extensions over EoH

4.3 FunSearch (Nature 2024)

FunSearch Architecture

4.4 ReEvo — Reflective Evolution (NeurIPS 2024)

Reflection Loop

Reflection Prompt Structure

4.5 MCTS-AHD — Monte Carlo Tree Search for Algorithm Design (ICML 2025)

MCTS Formulation

4.6 (1+1)-EPS — Evolutionary Program Search (PPSN 2024)

Algorithm

4.7 LLaMEA (IEEE TEVC 2025)

Key Difference from Other Methods

Search Methods Comparison

5. Supported Tasks & Benchmarks

Combinatorial Optimization

Machine Learning & AI

Scientific Computing

Mathematical Discovery

6. LLM Backend Support

Custom Backend Integration

7. Platform Features

GUI Interface

Logging Integration

Multiprocessing with Timeout

Resumable Runs

8. API Reference & Code Examples

Core Classes

Full Example: Comparing Methods on TSP

9. Results & Benchmarks

Bin Packing Results (Falkenauer Triplet Instances)

TSP Results (TSPLIB eil51, att48, kroA100)

LLM Comparison (EoH on Bin Packing)

10. Circle Packing World Record

Problem Definition

Method Configuration

Circle Packing Results by n

11. Comparison with Other Platforms

12. Limitations & Future Work

Current Limitations

Planned Features