LLM4AD
Unified Open-Source Platform for LLM-based Automatic Algorithm Design
Authors: City University of Hong Kong & Southern University of Science and Technology
Repository: github.com/Optima-CityU/llm4ad
License: MIT
Python: 3.9–3.12
**Key Contribution:** LLM4AD provides a unified platform integrating 7 state-of-the-art search methods for automatic algorithm design, supporting 12+ combinatorial optimization tasks and multiple LLM backends. It achieves a world record for circle packing at n=26 with score 2.63594.
Table of Contents
- Overview & Motivation
- Installation & Quick Start
- Platform Architecture
- Search Methods (7 Algorithms)
- EoH — Evolution of Heuristics (ICML 2024)
- MEoH — Multi-objective EoH (AAAI 2025)
- FunSearch (Nature 2024)
- ReEvo — Reflective Evolution (NeurIPS 2024)
- MCTS-AHD (ICML 2025)
- (1+1)-EPS (PPSN 2024)
- LLaMEA (IEEE TEVC 2025)
- Supported Tasks & Benchmarks
- LLM Backend Support
- Platform Features
- API Reference & Code Examples
- Results & Benchmarks
- Circle Packing World Record
- Comparison with Other Platforms
- Limitations & Future Work
1. Overview & Motivation
LLM4AD (Large Language Models for Automatic Algorithm Design) addresses the fragmentation problem in the LLM-for-optimization research landscape. Each published method (EoH, FunSearch, ReEvo, etc.) comes with its own codebase, API, evaluation harness, and LLM integration. Researchers wanting to compare methods must re-implement or adapt each system independently. LLM4AD provides a unified platform where all methods share:
- A common interface for defining optimization problems (tasks)
- A unified LLM abstraction layer supporting GPT-4o, Gemini Pro, DeepSeek, and local models (Llama, Gemma)
- Shared evaluation infrastructure with multiprocessing, timeouts, and caching
- Consistent logging and visualization via W&B and TensorBoard
- A graphical user interface (GUI) for interactive experimentation
- Resumable runs with checkpoint/restart capability
Design Philosophy
Modular & Extensible
Each search method, task, and LLM backend is a pluggable module. Adding a new search method requires implementing a single interface. Adding a new task requires defining an evaluation function and problem specification.
Reproducible & Fair
All methods run under identical conditions — same LLM, same evaluation budget, same hardware. This enables fair apples-to-apples comparison that is impossible when methods use different codebases.
Practical & Production-Ready
Beyond research, LLM4AD is designed for real-world use. Multiprocessing, timeout handling, GPU-aware scheduling, and resumable runs make it suitable for long-running optimization campaigns.
Educational & Accessible
The GUI mode and comprehensive documentation lower the barrier to entry. Researchers can experiment with different methods without deep understanding of each algorithm's internals.
2. Installation & Quick Start
Installation
# Basic installation
pip install llm4ad
# With GUI support
pip install llm4ad[gui]
# With all LLM backends
pip install llm4ad[all]
# Development installation
git clone https://github.com/Optima-CityU/llm4ad.git
cd llm4ad
pip install -e ".[dev]"
Quick Start: Bin Packing with EoH
from llm4ad import LLM4AD, EoH, BinPacking
from llm4ad.llm import OpenAIBackend
# Configure LLM backend
llm = OpenAIBackend(model="gpt-4o", api_key="sk-...")
# Define the task
task = BinPacking(
instance="benchmark/bpp_500_items",
metric="waste_ratio",
direction="minimize",
)
# Select search method
method = EoH(
population_size=20,
num_generations=50,
crossover_rate=0.3,
mutation_rate=0.7,
elite_size=3,
)
# Run optimization
runner = LLM4AD(
method=method,
task=task,
llm=llm,
max_workers=4,
timeout_per_eval=30,
log_to="wandb", # Log to Weights & Biases
project_name="llm4ad-bpp",
)
result = runner.run()
print(f"Best waste ratio: {result.best_score:.4f}")
print(f"Best algorithm:\n{result.best_code}")
GUI Mode
# Launch the graphical interface
llm4ad gui --port 8080
# Or from Python
from llm4ad.gui import launch
launch(port=8080)
3. Platform Architecture
+-----------------------------------------------------------------------------------+
| LLM4AD Platform Architecture |
+-----------------------------------------------------------------------------------+
| |
| +----------------------------------+ +--------------------------------------+ |
| | User Interface | | Configuration Layer | |
| | | | | |
| | CLI | Python API | GUI (Web) | | YAML Config | Programmatic API | |
| +----------------------------------+ +--------------------------------------+ |
| | | |
| v v |
| +------------------------------------------------------------------------+ |
| | LLM4AD Orchestrator | |
| | | |
| | +-------------------+ +-------------------+ +----------------+ | |
| | | Search Method | | Task Registry | | LLM Backend | | |
| | | Selector | | | | Manager | | |
| | | | | - BinPacking | | | | |
| | | - EoH | | - TSP | | - OpenAI | | |
| | | - MEoH | | - FacilityLoc | | - Google | | |
| | | - FunSearch | | - Knapsack | | - DeepSeek | | |
| | | - ReEvo | | - QAP | | - Local (vLLM) | | |
| | | - MCTS-AHD | | - Scheduling | | | | |
| | | - (1+1)-EPS | | - BO Acquisition | +----------------+ | |
| | | - LLaMEA | | - RL Environments | | |
| | +-------------------+ | - CFD Turbulence | | |
| | | | - Bacteria Growth | | |
| | | | - Math Discovery | | |
| | | +-------------------+ | |
| | v | | |
| | +------------------------------------------------------------------+ | |
| | | Evaluation Engine | | |
| | | | | |
| | | [Multiprocessing Pool] --> [Timeout Guard] --> [Score Collect] | | |
| | | | | | | | |
| | | max_workers=N per-eval timeout Caching | | |
| | | Process isolation Graceful kill Content hash | | |
| | +------------------------------------------------------------------+ | |
| | | | |
| | v | |
| | +------------------------------------------------------------------+ | |
| | | Logging & Visualization | | |
| | | | | |
| | | W&B Integration | TensorBoard | CSV Export | Checkpoints | | |
| | +------------------------------------------------------------------+ | |
| +------------------------------------------------------------------------+ |
+-----------------------------------------------------------------------------------+
Key Architectural Decisions
- Plugin Architecture: Search methods, tasks, and LLM backends are all registered via a plugin system. New components can be added without modifying core platform code.
- Shared Evaluation: All search methods use the same evaluation engine, ensuring identical sandboxing, timing, and scoring across methods.
- Checkpoint/Resume: The orchestrator periodically checkpoints the entire state (population, history, random seeds), enabling resumable runs after crashes or interruptions.
- Process Isolation: Each evaluation runs in a separate process with strict timeout enforcement, preventing infinite loops or memory leaks from crashing the main process.
4. Search Methods (7 Algorithms)
4.1 EoH — Evolution of Heuristics (ICML 2024)
**Paper:** "Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model"
**Venue:** `ICML 2024`
EoH evolves both the algorithmic idea (natural language description) and the code implementation simultaneously. This dual representation enables the LLM to reason about high-level algorithmic concepts while grounding them in executable code.
Key Mechanism: Thought-Code Co-Evolution
- Each individual in the population is a (thought, code) pair
- Mutation operates on both: the thought is mutated first, then the code is updated to match
- Crossover combines thoughts from two parents, then generates code for the combined thought
- Selection is based on code execution performance, but thought quality influences mutation quality
from llm4ad.methods import EoH
method = EoH(
population_size=20, # Population of (thought, code) pairs
num_generations=50, # Number of evolutionary generations
crossover_rate=0.3, # Probability of crossover vs mutation
mutation_rate=0.7, # Probability of mutation
elite_size=3, # Top-3 preserved across generations
thought_mutation_temp=0.9, # Higher temp for thought diversity
code_mutation_temp=0.7, # Lower temp for code precision
tournament_size=3, # Tournament selection size
)
EoH Operators
| Operator | Description | Input | Output |
|---|---|---|---|
| e1: Thought Mutation | LLM modifies the algorithmic idea | Parent thought + task description | New thought + corresponding code |
| e2: Code Mutation | LLM modifies the code while preserving the thought | Parent thought + parent code | Same thought + modified code |
| c1: Thought Crossover | LLM combines ideas from two parents | Two parent thoughts | New thought + corresponding code |
| c2: Code Crossover | LLM combines code from two parents | Two parent (thought, code) pairs | Combined thought + combined code |
4.2 MEoH — Multi-objective Evolution of Heuristics (AAAI 2025)
**Paper:** "Multi-objective Evolution of Heuristics Using Large Language Models"
**Venue:** `AAAI 2025`
MEoH extends EoH to handle multiple conflicting objectives simultaneously, maintaining a Pareto frontier of non-dominated (thought, code) pairs.
Key Extensions over EoH
- Multi-objective evaluation: Each candidate is scored on multiple metrics (e.g., solution quality + runtime)
- Pareto-based selection: Non-dominated sorting + crowding distance for diverse frontier exploration
- Objective-aware mutation: The LLM is told which objectives are lagging and asked to focus improvements there
- Hypervolume tracking: Progress measured by hypervolume indicator rather than single best score
from llm4ad.methods import MEoH
method = MEoH(
population_size=30,
num_generations=100,
objectives=["quality", "runtime"],
directions=["maximize", "minimize"],
crossover_rate=0.3,
mutation_rate=0.7,
reference_point=[0.0, 1000.0], # For hypervolume computation
objective_focus_strategy="lagging", # Focus mutation on lagging objectives
)
4.3 FunSearch (Nature 2024)
**Paper:** "Mathematical discoveries from program search with large language models"
**Venue:** `Nature 2024`
FunSearch, developed by DeepMind, uses a best-shot sampling approach: generate many candidates from the LLM, evaluate all of them, and keep only the best. It uses an island model with periodic migration to maintain population diversity.
FunSearch Architecture
- Sampler Pool: Multiple LLM samplers generate candidates in parallel
- Programs Database: Island-structured database storing candidates, organized by score tiers
- Evaluator Pool: Parallel evaluators score candidates, with strict timeouts
- Best-shot Strategy: From each island, the top-k candidates are selected as prompting examples for the LLM
from llm4ad.methods import FunSearch
method = FunSearch(
num_islands=10, # Number of islands
programs_per_island=50, # Population per island
num_samplers=4, # Parallel LLM samplers
samples_per_prompt=4, # Candidates per LLM call
temperature=1.0, # High temperature for diversity
top_k_for_prompt=2, # Best-k examples in prompt
reset_period=100, # Reset worst island every N evals
migration_interval=50, # Migrate between islands
)
4.4 ReEvo — Reflective Evolution (NeurIPS 2024)
**Paper:** "ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution"
**Venue:** `NeurIPS 2024`
ReEvo introduces reflective evolution where the LLM not only generates mutations but also reflects on why previous candidates succeeded or failed. This self-reflective capability produces more targeted and effective mutations.
Reflection Loop
- Generate: LLM produces a candidate heuristic
- Evaluate: Run the candidate on the benchmark
- Reflect: LLM analyzes the evaluation results, identifying strengths and weaknesses
- Refine: LLM uses the reflection to propose a targeted improvement
- Repeat: The refined candidate becomes the new parent
from llm4ad.methods import ReEvo
method = ReEvo(
population_size=15,
num_generations=60,
reflection_depth=2, # Number of reflection iterations per mutation
short_term_memory=5, # Recent reflections kept in context
long_term_memory=20, # Total reflections stored
reflection_temperature=0.7, # Temperature for reflection LLM calls
mutation_temperature=0.8, # Temperature for mutation LLM calls
)
Reflection Prompt Structure
# ReEvo reflection prompt (simplified)
"""
## Current Heuristic
{code}
## Evaluation Results
Score: {score}
Test case breakdown:
{per_case_results}
## Previous Reflections
{recent_reflections}
## Task
1. Analyze WHY this heuristic scored {score}
2. Identify the specific weakness causing failures
3. Propose a concrete improvement strategy
4. Explain how the improvement addresses the weakness
## Your Reflection:
"""
4.5 MCTS-AHD — Monte Carlo Tree Search for Algorithm Design (ICML 2025)
**Paper:** "Monte Carlo Tree Search for Automatic Heuristic Design"
**Venue:** `ICML 2025`
MCTS-AHD frames algorithm design as a sequential decision problem and uses Monte Carlo Tree Search to explore the space of algorithmic building blocks. Each node in the tree represents an algorithmic component choice.
MCTS Formulation
- State: Partially specified algorithm (sequence of component choices so far)
- Action: Choose the next algorithmic component (data structure, operator, control flow)
- Reward: Evaluation score of the completed algorithm
- UCB Selection: Balance exploration of new component choices vs exploitation of known-good paths
$$ UCB(s, a) = Q(s, a) + c ⋅ √(ln N(s) / N(s, a)) $$
Where Q(s,a) is the average reward for taking action a in state s, N(s) is the visit count for state s, N(s,a) is the visit count for action a in state s, and c is the exploration constant.
from llm4ad.methods import MCTSAHD
method = MCTSAHD(
max_depth=8, # Maximum tree depth (algorithm complexity)
num_simulations=200, # MCTS simulations per step
exploration_constant=1.414, # UCB exploration parameter (sqrt(2))
expansion_width=5, # Number of children per expansion
rollout_policy="llm", # Use LLM for rollout (vs random)
backprop_strategy="max", # Backpropagate max score (vs average)
)
4.6 (1+1)-EPS — Evolutionary Program Search (PPSN 2024)
**Paper:** "(1+1)-Evolutionary Program Search"
**Venue:** `PPSN 2024`
(1+1)-EPS is the simplest method: it maintains a single solution and iteratively mutates it, keeping the mutation only if it improves the score. Despite its simplicity, it is surprisingly effective for well-defined optimization tasks.
Algorithm
# (1+1)-EPS pseudocode
def eps_search(initial_solution, evaluate, llm, max_iterations):
current = initial_solution
current_score = evaluate(current)
for i in range(max_iterations):
# Mutate the current solution using LLM
mutant = llm.mutate(current, feedback=get_feedback(current))
# Evaluate the mutant
mutant_score = evaluate(mutant)
# Accept if improved (greedy)
if mutant_score >= current_score:
current = mutant
current_score = mutant_score
log(f"Iteration {i}: improved to {current_score}")
return current, current_score
from llm4ad.methods import EPS
method = EPS(
max_iterations=200, # Total mutation attempts
mutation_temperature=0.8, # LLM temperature for mutations
include_feedback=True, # Include eval feedback in mutation prompt
feedback_window=5, # Include last 5 mutations in context
restart_on_stagnation=True, # Restart from best-so-far after N stagnant iters
stagnation_threshold=30, # Iterations without improvement to trigger restart
)
4.7 LLaMEA (IEEE TEVC 2025)
**Paper:** "LLaMEA: A Large Language Model Evolutionary Algorithm for Automatically Generating Metaheuristics"
**Venue:** `IEEE TEVC 2025`
LLaMEA generates entire metaheuristic algorithms (not just heuristics for specific problems). The LLM is prompted to generate complete evolutionary or swarm-based algorithms, which are then evaluated on a portfolio of benchmark problems.
Key Difference from Other Methods
While EoH/ReEvo evolve problem-specific heuristics, LLaMEA evolves general-purpose optimization algorithms that can be applied to any problem. The output is a complete metaheuristic (like a new variant of PSO or DE) rather than a heuristic for TSP.
from llm4ad.methods import LLaMEA
method = LLaMEA(
population_size=10,
num_generations=30,
algorithm_template="metaheuristic", # Generate full metaheuristics
benchmark_portfolio=[ # Evaluate on multiple problems
"sphere_d10", "rastrigin_d10",
"rosenbrock_d10", "ackley_d10",
],
aggregation="geometric_mean", # Aggregate scores across benchmarks
mutation_strategy="component_swap", # Swap algorithmic components
)
Search Methods Comparison
| Method | Venue | Population | Key Mechanism | Multi-Obj | Reflection | Best For |
|---|---|---|---|---|---|---|
| EoH | ICML 2024 | 20–50 | Thought-code co-evolution | No | No | General heuristics |
| MEoH | AAAI 2025 | 30–100 | Pareto + objective-aware mutation | Yes |
No | Multi-objective problems |
| FunSearch | Nature 2024 | Islands | Best-shot sampling + islands | No | No | Mathematical discovery |
| ReEvo | NeurIPS 2024 | 10–30 | Self-reflective evolution | No | Yes |
Complex heuristics |
| MCTS-AHD | ICML 2025 | Tree | UCB-guided component selection | No | No | Compositional algorithms |
| (1+1)-EPS | PPSN 2024 | 1 | Greedy hill climbing | No | Partial | Quick prototyping |
| LLaMEA | IEEE TEVC 2025 | 10–20 | Metaheuristic generation | No | No | Algorithm generation |
5. Supported Tasks & Benchmarks
Combinatorial Optimization
| Task | Description | Metric | Instances |
|---|---|---|---|
| Bin Packing | Pack items into fixed-capacity bins minimizing waste | Waste ratio (lower is better) | Falkenauer, Scholl, random (50–5000 items) |
| TSP | Traveling Salesman Problem — find shortest tour | Tour length (lower is better) | TSPLIB (14–2392 cities), random |
| Facility Location | Place facilities to minimize total transportation cost | Total cost (lower is better) | OR-Library, random (10–500 facilities) |
| Knapsack | Select items maximizing value within weight capacity | Total value (higher is better) | Pisinger, random (50–10000 items) |
| QAP | Quadratic Assignment — assign facilities to locations | Total flow*distance (lower is better) | QAPLIB (12–256 facilities) |
| Scheduling | Job-shop and flow-shop scheduling | Makespan (lower is better) | Taillard, random (10x5 to 100x20) |
Machine Learning & AI
| Task | Description | Metric | Instances |
|---|---|---|---|
| BO Acquisition | Design acquisition functions for Bayesian Optimization | Regret (lower is better) | Branin, Hartmann, Levy functions |
| RL Environments | Design reward shaping or policy heuristics | Cumulative reward (higher is better) | CartPole, MountainCar, LunarLander |
Scientific Computing
| Task | Description | Metric | Instances |
|---|---|---|---|
| CFD Turbulence | Design turbulence models for computational fluid dynamics | Prediction error vs DNS (lower is better) | Channel flow, flat plate, backward-facing step |
| Bacteria Growth | Design growth rate models for bacterial colonies | Fit to experimental data (higher R²) | E. coli, S. cerevisiae datasets |
Mathematical Discovery
| Task | Description | Metric | Notes |
|---|---|---|---|
| Circle Packing | Pack n circles in minimum enclosing circle | Ratio of sum of radii to enclosing radius (higher is better) | n=5 to n=30 |
| Cap Set Discovery | Find large cap sets in GF(3)^n | Cap set size (higher is better) | n=4 to n=8 |
| Extremal Combinatorics | Construct extremal graph colorings | Problem-specific score | Various open problems |
6. LLM Backend Support
| Backend | Models | Type | Configuration |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1, o3 | Cloud API | OpenAIBackend(model="gpt-4o", api_key="...") |
| Gemini Pro, Gemini Flash, Gemini 2.5 | Cloud API | GoogleBackend(model="gemini-2.0-pro", api_key="...") |
|
| DeepSeek | DeepSeek V3, DeepSeek R1 | Cloud API | DeepSeekBackend(model="deepseek-v3", api_key="...") |
| Anthropic | Claude Sonnet 4, Claude Opus 4 | Cloud API | AnthropicBackend(model="claude-sonnet-4-20250514", api_key="...") |
| Local (vLLM) | Llama 3.1, Gemma 2, Mistral, Qwen | Local | VLLMBackend(model_path="meta-llama/Llama-3.1-70B") |
| Local (Ollama) | Any Ollama-supported model | Local | OllamaBackend(model="llama3.1:70b") |
Custom Backend Integration
from llm4ad.llm import LLMBackend, LLMResponse
class MyBackend(LLMBackend):
"""Custom LLM backend for LLM4AD."""
async def generate(self, prompt: str, **kwargs) -> LLMResponse:
response = await my_api(prompt, **kwargs)
return LLMResponse(
text=response.text,
tokens_used=response.total_tokens,
model=self.model_name,
)
def get_model_info(self) -> dict:
return {
"name": self.model_name,
"context_window": 128000,
"supports_system_prompt": True,
}
7. Platform Features
GUI Interface
LLM4AD includes a web-based GUI for interactive experimentation:
- Method Configuration: Visual parameter tuning for each search method
- Real-time Monitoring: Live convergence plots, population diversity metrics, and cost tracking
- Result Comparison: Side-by-side comparison of multiple runs
- Code Browser: Inspect generated algorithms with syntax highlighting
- Export: Export results as CSV, JSON, or LaTeX tables
Logging Integration
# Weights & Biases logging
runner = LLM4AD(
method=method,
task=task,
llm=llm,
log_to="wandb",
project_name="my-experiment",
run_name="eoh-binpacking-v1",
tags=["eoh", "binpacking", "gpt4o"],
)
# TensorBoard logging
runner = LLM4AD(
method=method,
task=task,
llm=llm,
log_to="tensorboard",
log_dir="./runs/experiment-1",
)
# Both simultaneously
runner = LLM4AD(
method=method,
task=task,
llm=llm,
log_to=["wandb", "tensorboard"],
)
Multiprocessing with Timeout
runner = LLM4AD(
method=method,
task=task,
llm=llm,
max_workers=8, # 8 parallel evaluation workers
timeout_per_eval=60, # Kill evaluation after 60 seconds
memory_limit_mb=2048, # 2GB memory limit per evaluation
retry_on_timeout=True, # Retry timed-out evaluations once
gpu_allocation={ # GPU assignment per worker
0: [0, 1], # Workers 0-1 use GPU 0
1: [2, 3], # Workers 2-3 use GPU 1
},
)
Resumable Runs
# Start a run with checkpointing
runner = LLM4AD(
method=method,
task=task,
llm=llm,
checkpoint_dir="./checkpoints/exp-1",
checkpoint_interval=10, # Checkpoint every 10 generations
)
result = runner.run()
# Resume from checkpoint after crash/interruption
runner = LLM4AD.resume("./checkpoints/exp-1")
result = runner.run() # Continues from last checkpoint
8. API Reference & Code Examples
Core Classes
[!info]- LLM4AD - Main Orchestrator ``` class LLM4AD: def init( self, method: SearchMethod, # Search algorithm to use task: Task, # Optimization task llm: LLMBackend, # LLM backend max_workers: int = 4, # Parallel evaluation workers timeout_per_eval: int = 60, # Evaluation timeout (seconds) log_to: str | list = None, # Logging backend(s) checkpoint_dir: str = None, # Directory for checkpoints checkpoint_interval: int = 10, # Checkpoint frequency seed: int = 42, # Random seed verbose: bool = True, # Print progress ): ...
def run(self) -> OptimizationResult: """Run the optimization until stopping criteria are met.""" ... @classmethod def resume(cls, checkpoint_dir: str) -> "LLM4AD": """Resume from a checkpoint.""" ...```
[!info]- Task - Problem Definition ``` class Task: """Base class for optimization tasks."""
def __init__( self, name: str, # Task name instance: str | dict, # Problem instance data metric: str, # Primary metric name direction: str = "minimize", # "minimize" or "maximize" function_signature: str = None, # Required function signature imports: list[str] = None, # Allowed imports timeout: int = 30, # Per-evaluation timeout ): ... def evaluate(self, code: str) -> EvalResult: """Evaluate a candidate algorithm.""" ... def get_prompt_context(self) -> str: """Return task description for LLM prompts.""" ...```
[!info]- OptimizationResult - Output ``` class OptimizationResult: best_code: str # Best algorithm code best_score: float # Best score achieved best_thought: str | None # Best algorithmic idea (EoH/MEoH) history: list[EvaluationRecord] # Full evaluation history population: list[Individual] # Final population stats: RunStats # Runtime statistics
def plot_convergence(self, save_path: str = None): """Plot convergence curve.""" ... def plot_population_diversity(self, save_path: str = None): """Plot population diversity over time.""" ... def export_csv(self, path: str): """Export history to CSV.""" ... def export_latex_table(self) -> str: """Generate LaTeX table of results.""" ...```
Full Example: Comparing Methods on TSP
from llm4ad import LLM4AD
from llm4ad.methods import EoH, FunSearch, ReEvo, MCTSAHD, EPS
from llm4ad.tasks import TSP
from llm4ad.llm import OpenAIBackend
llm = OpenAIBackend(model="gpt-4o")
task = TSP(instance="tsplib/eil51", metric="tour_length", direction="minimize")
methods = {
"EoH": EoH(population_size=20, num_generations=50),
"FunSearch": FunSearch(num_islands=5, programs_per_island=20),
"ReEvo": ReEvo(population_size=15, num_generations=60),
"MCTS-AHD": MCTSAHD(max_depth=8, num_simulations=200),
"(1+1)-EPS": EPS(max_iterations=200),
}
results = {}
for name, method in methods.items():
print(f"\n--- Running {name} ---")
runner = LLM4AD(method=method, task=task, llm=llm, max_workers=4)
results[name] = runner.run()
print(f"{name}: best tour length = {results[name].best_score:.2f}")
# Compare results
print("\n=== Comparison ===")
for name, result in sorted(results.items(), key=lambda x: x[1].best_score):
print(f"{name:15s}: {result.best_score:.2f} (evals: {result.stats.total_evaluations})")
9. Results & Benchmarks
Bin Packing Results (Falkenauer Triplet Instances)
| Method | LLM | Waste (%) | Gap to BKS (%) | Evaluations | Time (min) |
|---|---|---|---|---|---|
| EoH | GPT-4o | 1.23 | 0.08 | 1,000 | 45 |
| MEoH | GPT-4o | 1.20 | 0.05 | 1,500 | 72 |
| FunSearch | GPT-4o | 1.35 | 0.20 | 5,000 | 180 |
| ReEvo | GPT-4o | 1.18 | 0.03 | 900 | 55 |
| MCTS-AHD | GPT-4o | 1.25 | 0.10 | 800 | 60 |
| (1+1)-EPS | GPT-4o | 1.30 | 0.15 | 200 | 20 |
| LLaMEA | GPT-4o | 1.40 | 0.25 | 300 | 35 |
| Best Known | — | 1.15 | 0.00 | — | — |
TSP Results (TSPLIB eil51, att48, kroA100)
| Method | eil51 Gap% | att48 Gap% | kroA100 Gap% | Avg Gap% |
|---|---|---|---|---|
| EoH | 2.1 | 1.8 | 3.5 | 2.47 |
| ReEvo | 1.8 | 1.5 | 2.9 | 2.07 |
| MCTS-AHD | 2.0 | 1.7 | 3.2 | 2.30 |
| FunSearch | 2.5 | 2.2 | 4.1 | 2.93 |
| (1+1)-EPS | 2.8 | 2.5 | 4.5 | 3.27 |
LLM Comparison (EoH on Bin Packing)
| LLM | Waste (%) | Cost per Run | Eval Rate | Success Rate |
|---|---|---|---|---|
| GPT-4o | 1.23 | $15.20 | 22 eval/min | 85% |
| Claude Sonnet 4 | 1.21 | $18.50 | 18 eval/min | 88% |
| Gemini 2.0 Pro | 1.25 | $12.30 | 25 eval/min | 82% |
| DeepSeek V3 | 1.28 | $3.50 | 15 eval/min | 78% |
| Llama 3.1 70B (local) | 1.45 | $0 (GPU cost) | 8 eval/min | 65% |
10. Circle Packing World Record
**World Record:** LLM4AD achieved a new world record for circle packing at n=26, with a score of **2.63594**, surpassing previous best results from AlphaEvolve and classical optimization methods.
Problem Definition
Given n unit circles, find the arrangement that minimizes the radius of the enclosing circle. Equivalently, maximize the ratio of the sum of radii to the enclosing radius.
$$ maximize R = (∑i=1^n ri) / renclosing $$
Method Configuration
from llm4ad import LLM4AD
from llm4ad.methods import FunSearch
from llm4ad.tasks import CirclePacking
from llm4ad.llm import GoogleBackend
task = CirclePacking(
n=26,
metric="packing_ratio",
direction="maximize",
evaluator="exact", # Use exact geometric computation
timeout_per_eval=120, # Allow longer evals for large n
)
method = FunSearch(
num_islands=15,
programs_per_island=100,
num_samplers=8,
samples_per_prompt=8,
temperature=1.0,
top_k_for_prompt=3,
reset_period=200,
)
llm = GoogleBackend(model="gemini-2.0-pro")
runner = LLM4AD(method=method, task=task, llm=llm, max_workers=16)
result = runner.run()
print(f"Best packing ratio: {result.best_score:.5f}") # 2.63594
Circle Packing Results by n
| n | Previous Best | LLM4AD Result | Method | Evaluations |
|---|---|---|---|---|
| 10 | 2.38660 | 2.38660 | Matched optimal | 500 |
| 15 | 2.47540 | 2.47540 | Matched optimal | 1,200 |
| 20 | 2.52040 | 2.52042 | Slight improvement | 3,000 |
| 22 | 2.56287 | 2.56290 | Improvement | 5,000 |
| 24 | 2.60240 | 2.60248 | Improvement | 8,000 |
| 26 | 2.63590 | 2.63594 | World record | 15,000 |
11. Comparison with Other Platforms
| Feature | LLM4AD | AlphaEvolve | OpenEvolve | GEPA | ShinkaEvolve |
|---|---|---|---|---|---|
| Open Source | MIT |
No | Yes |
Yes |
Apache 2.0 |
| Search Methods | 7 methods | 1 (custom) | 1 (custom) | 1 (Pareto) | 1 (custom) |
| Built-in Tasks | 12+ tasks | Custom only | Custom only | Custom only | Custom only |
| LLM Backends | 6 backends | Google only | Multi | Multi | Multi |
| GUI | Yes |
No | No | No | Yes |
| Multi-Objective | MEoH |
Weighted | Weighted | Pareto |
No |
| Resumable | Yes |
Yes | Yes | Yes | Yes |
| W&B/TB Logging | Both |
No | No | No | Custom |
12. Limitations & Future Work
Current Limitations
- Python Only: Generated algorithms are restricted to Python. No support for C++, Rust, or Julia algorithms that could be significantly faster.
- Limited Task Complexity: Built-in tasks cover standard benchmarks but not full-scale industrial problems with complex constraints.
- LLM Cost: Running 7 methods across many tasks for fair comparison is expensive. A complete benchmark suite can cost hundreds of dollars in LLM API calls.
- No Cross-Method Ensembling: Currently no mechanism to combine insights from multiple search methods during a single run.
- Evaluation Bottleneck: For expensive-to-evaluate tasks (CFD, RL), evaluation rather than LLM calls becomes the bottleneck.
Planned Features
- Method Ensembling: Run multiple search methods in parallel, sharing promising candidates across methods.
- Distributed Execution: Distribute evaluation across multiple machines via Ray or Dask.
- Auto-Method Selection: Automatically select the best search method for a given task based on problem characteristics.
- Multi-Language Support: Generate and evaluate algorithms in C++, Julia, and Rust for performance-critical applications.
- Benchmark Hub: Community-contributed tasks and benchmark results, similar to HuggingFace model hub.
- Meta-Learning: Learn from past optimization runs to warm-start new runs on similar problems.
**Summary:** LLM4AD is the most comprehensive platform for LLM-based automatic algorithm design. Its 7 integrated search methods, 12+ built-in tasks, multiple LLM backends, GUI, and production features (logging, resumability, multiprocessing) make it the go-to platform for both researchers comparing methods and practitioners solving real-world optimization problems.