← Back to Index

Bilevel Autoresearch

A bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime Organization: Independent Research (Yaonan Qu, Meng Lu) Published: March 24, 2026 Type: paper/repo Report Type: PhD-Level Technical Analysis Report Date: April 2026

Table of Contents


1 Full Title and Attribution

Full Title: Bilevel Autoresearch: Meta-Autoresearching Itself

ArXiv: arXiv:2603.23420 (cs.AI)

Repository: github.com/EdwardOptimization/Bilevel-Autoresearch

License: Open release (code and experiment logs)

Status: Research prototype with published experiment logs

Lineage: Extends the Karpathy autoresearch paradigm (2026) by adding a bilevel meta-optimization layer. Directly references AutoResearchClaw (AIMing Lab, 2026) and EvoScientist (2026) as prior systems that required human-designed improvements.

"If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop."


2 Authors and Team

Author Affiliation Contact
Yaonan Qu Independent Researcher EdwardOptimization@gmail.com
Meng Lu Independent Researcher menglu_16@connect.hku.hk

This is a notable contribution from independent researchers outside the major academic-industrial labs. The research demonstrates that fundamental advances in autoresearch methodology do not require institutional infrastructure—the experiments run on a single RTX 5090 GPU with a commodity LLM API (DeepSeek).

Meng Lu's HKU email suggests a connection to the University of Hong Kong, though the paper lists "Independent Researcher" as the affiliation for both authors.


3 Core Contribution

The Foundational Observation

Every existing autoresearch system was improved by a human who: 1. Read the system's code 2. Identified a bottleneck 3. Wrote new code to address it

System Human-Designed Improvement Year
Karpathy autoresearch Single-track propose-evaluate-accept loop 2026
AutoResearchClaw Multi-batch parallel search (branching factor ↑) 2026
EvoScientist Persistent experience memory across runs 2026

The authors ask: can an LLM perform that same design step autonomously?

The Answer: Bilevel Optimization

The paper formalizes autoresearch as a bilevel optimization problem:

min_φ  F(φ, θ*(φ))           # Outer: optimize search mechanism φ
s.t.   θ*(φ) ∈ argmin_θ f(θ, φ)   # Inner: optimize task performance θ

where: - φ = the search mechanism (runner code)—a discrete artifact produced by code generation - θ = the task performance (training configuration) - f = the inner objective (validation loss) - F = the outer objective (best validation loss achieved by the inner loop)

The key departure from classical bilevel optimization: φ is a program, not a parameter vector. The outer level generates and replaces code rather than adjusting continuous parameters.

Three-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│  LEVEL 2: Mechanism Research (Meta-Autoresearch)            │
│  Generates new Python search mechanisms via 4-round         │
│  LLM dialogue. Injects code at runtime.                     │
│  Executes: every 2 outer cycles                             │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  LEVEL 1.5: Search Strategy Adjustment                │  │
│  │  Freezes/unfreezes parameters. Injects guidance.      │  │
│  │  Executes: every 5 inner iterations                   │  │
│  │                                                       │  │
│  │  ┌─────────────────────────────────────────────────┐  │  │
│  │  │  LEVEL 1: Inner Autoresearch Loop               │  │  │
│  │  │  Standard propose → train → evaluate → accept   │  │  │
│  │  │  Executes: every iteration (30 per repeat)      │  │  │
│  │  └─────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Core Result

Level 2 (mechanism research) achieves a 5× improvement over Level 1 alone: −0.045 vs. −0.009 val_bpb on Karpathy's GPT pretraining benchmark. Level 1.5 (parameter adjustment) yields no reliable gain over Level 1 alone.

The generated mechanisms—Tabu Search, Multi-Scale Bandit, Orthogonal Exploration—are autonomously discovered from combinatorial optimization, online learning, and design of experiments, without human specification of which domains to explore.


4 Supported Solutions

Solution Type Support Level Details
Neural network hyperparameter optimization Primary target GPT pretraining configuration search
Autoresearch loop improvement Core mechanism Self-modification of search logic
Search mechanism generation Core mechanism Novel Python algorithms generated at runtime
Bilevel optimization Framework Any problem expressible as nested optimization
Algorithm discovery Demonstrated Tabu Search, Bandits, DOE discovered autonomously

Task Scope

The current implementation targets a specific but representative task:

Property Value
Task GPT pretraining optimization
Model size 50M parameters
Training budget 300 seconds per evaluation
Hardware Single RTX 5090
Metric Validation bits-per-byte (val_bpb)
Editable parameters LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, TOTAL_BATCH_SIZE, etc.
Frozen parameters DEPTH=8, ASPECT_RATIO=64 (architecture size locked)

Generality Claim

The paper's central thesis is that the bilevel framework is task-agnostic: any problem with a measurable objective can be meta-autoresearched. The authors explicitly state:

"The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective."

However, this generality is unproven beyond the single benchmark. The authors acknowledge this in their limitations section.


5 LLM Integration

Single-Model Architecture

A defining feature of Bilevel Autoresearch is that all three levels use the same LLM:

Property Value
Model DeepSeek deepseek-chat
Provider DeepSeek API
Used at Level 1 Proposes hyperparameter changes
Used at Level 1.5 Generates SearchConfig updates
Used at Level 2 4-round research dialogue generating Python code
Stronger model required? No — improvements come from architecture, not model capability

"Both loops use the same LLM—no stronger model is needed at the meta level."

This is a significant design choice. It means the outer loop cannot succeed by simply being "smarter"—it must succeed through structural advantage (bilevel architecture, code injection, mechanism diversity).

Level 1: LLM as Configuration Proposer

At each inner iteration, the LLM receives: 1. Current train.py (frozen at best accepted configuration) 2. List of active editable parameters 3. List of frozen parameters 4. Strategic guidance string (injected by Level 1.5)

The LLM proposes: - A set of parameter name–value pairs - A one-sentence hypothesis explaining the rationale

Input to LLM (Level 1):
  - Current train.py (best config)
  - Editable: [LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, ...]
  - Frozen: [DEPTH=8, ASPECT_RATIO=64]
  - Guidance: "Prioritize batch size exploration; LR has stalled"

Output from LLM:
  - Changes: {TOTAL_BATCH_SIZE: 128 → 256, LR: 3e-4 → 1e-4}
  - Hypothesis: "Larger batch with lower LR may stabilize training"

Level 1.5: LLM as Search Strategist

Every 5 inner iterations, the LLM receives the full trace of proposals and outcomes and generates a SearchConfig update:

  • Freeze parameters proposed ≥k times (default k=3) with zero net improvement
  • Unfreeze parameters that were frozen but haven't been explored since the search moved to a new region
  • Inject guidance instructing the inner loop to prioritize under-explored parameters

Critical constraint: Level 1.5 cannot change the proposal generation logic, acceptance criterion, or loop structure. These require Level 2.

Level 2: LLM as Mechanism Researcher

Level 2 is the most novel LLM integration. It conducts a 4-round structured dialogue, each round a single LLM call:

Round 1: EXPLORE
  Input:  Full runner.py source code + search trace
  Output: Survey of mechanisms from adjacent fields
          (combinatorial optimization, online learning, DOE, 
           Bayesian optimization)

Round 2: CRITIQUE  
  Input:  Candidate mechanisms from Round 1
  Output: Evaluation against observed failure modes
          (e.g., repetitive proposals, parameter fixation)
          Selection of most promising mechanism

Round 3: SPECIFY
  Input:  Selected mechanism from Round 2
  Output: Precise interface specification:
          - Class name
          - Constructor arguments
          - Key methods with signatures
          - Integration points in runner.py

Round 4: GENERATE
  Input:  Specification from Round 3
  Output: Complete, runnable Python code implementing the
          mechanism + modifications to runner.py

Each Level 2 session takes approximately 3 minutes of wall time (four LLM calls). Code is validated via importlib dynamic loading before activation. If import fails, the original runner is restored from backup.

LLM Capabilities Exercised

Capability Level 1 Level 1.5 Level 2
Hyperparameter reasoning
Search trace analysis
Code comprehension
Cross-domain knowledge
Interface design
Python code generation
Self-critique

6 Key Results

Four-Group Ablation Design

The experiment uses a controlled four-group ablation with 3 repeats per group:

Group Levels Active Description
A (Baseline) Level 1 only Standard inner autoresearch loop
B Level 1 + 1.5 Inner loop + parameter-level adjustment
C Level 1 + 1.5 + 2 Full bilevel system
D Level 1 + 2 Inner loop + mechanism research (no Level 1.5)

Primary Results

Group Mean Δval_bpb Std Dev Best Single Worst Single
A (L1 only) −0.009 ±0.002 −0.011 −0.007
B (L1 + L1.5) −0.006 ±0.006 −0.012 −0.000
C (L1 + L1.5 + L2) −0.045 ±0.030 −0.065 −0.011
D (L1 + L2) −0.034 ±0.031 −0.065 −0.001

Group C achieves a 5× improvement over Group A (−0.045 vs. −0.009). The improvement is driven entirely by Level 2 mechanism generation. Level 1.5 provides no reliable gain over Level 1 alone.

Hypothesis Testing

Hypothesis Result Evidence
H1: Group B > Group A Not supported B mean (−0.006) actually slightly worse than A (−0.009); B's R1 achieved −0.000
H2: Group C >> Group B Supported C mean (−0.045) is 7.5× B's (−0.006), with large separation
H3: L2 discovers novel mechanisms Supported Three distinct domains discovered (combinatorial opt, online learning, DOE)
H4: Group D ≈ Group C Supported (with caveats) D mean (−0.034) close to C (−0.045), but D has a zero-improvement repeat

Mechanisms Autonomously Discovered

Mechanism Domain Repeat Injection Success Effect
Tabu Search Manager Combinatorial Optimization C-R1 −0.065 (best overall)
Multi-Scale Bandit Proposer Online Learning (MAB) C-R2 −0.011 (modest)
Orthogonal Exploration Design of Experiments C-R3 −0.058 (strong)
GP Regressor Bayesian Optimization D-R2 ❌ (missing sklearn) Reverted
(Two additional mechanisms) Various D-R1, D-R3 Variable

Key insight: The discovered mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid. The LLM's default proposals exhibit strong mode-seeking behavior (repeatedly proposing similar changes); the meta-generated mechanisms inject structured diversity.

Per-Repeat Breakdown

Group A (Level 1 only):
  R1: −0.011  R2: −0.009  R3: −0.007
  Pattern: Consistent small improvements, tight variance

Group B (Level 1 + 1.5):
  R1: −0.000  R2: −0.012  R3: −0.005
  Pattern: High variance, L1.5 sometimes hurts (R1 = zero improvement)

Group C (Level 1 + 1.5 + 2):
  R1: −0.065  R2: −0.011  R3: −0.058
  Pattern: Two dramatic improvements, one modest; R2 underperformed

Group D (Level 1 + 2):
  R1: −0.065  R2: −0.001  R3: −0.036
  Pattern: Similar to C but slightly less robust; R2 near-zero

Why Group C-R2 Underperformed

The Multi-Scale Bandit Proposer generated for C-R2, while valid and correctly injected, was less effective at forcing exploration of the batch size dimension compared to Tabu Search (R1) or Orthogonal Exploration (R3). Additionally, each Level 2 session consumes ~3 minutes of wall time (four LLM calls), reducing effective inner iterations. With two sessions per repeat, Group C has ~6 minutes less search time.


7 Reproducibility

What Is Released

Artifact Location Description
Source code GitHub Full implementation including all three levels
Experiment logs GitHub Complete traces for all 12 runs (4 groups × 3 repeats)
Generated mechanisms In logs Python code for Tabu Search, Bandit, Orthogonal Exploration, etc.

Reproducibility Challenges

The paper is transparent about significant reproducibility limitations:

1. Small Sample Size (n=3)

"Three repeats per group is insufficient for rigorous statistical comparison. Group C's standard deviation (±0.030) is 67% of its absolute mean, indicating high variability. Reliable estimates would require n≥10 repeats per group."

With only 3 repeats, the results are suggestive but not statistically robust. The variance is large enough that Group C's superiority, while dramatic in magnitude, has wide confidence intervals.

2. Baseline Variance

Baseline val_bpb varies across repeats (1.094–1.114) due to training randomness from data ordering and weight initialization. Using Δ = best − baseline mitigates but does not eliminate this confound.

3. Single Benchmark

All results are on one task: GPT pretraining at 50M parameters with a 300-second budget on RTX 5090. Generalization to other model sizes, training budgets, or tasks is unproven.

4. Dynamic Loading Fragility

"A preliminary run was invalidated because the Level 2 dynamic loading pipeline contained a sys.modules registration bug, causing all mechanism injections to silently fall back to the original runner."

This "silent fallback" failure mode is particularly dangerous: the system appears to inject mechanisms but actually runs the original code. The bug was fixed before the reported results, but it highlights the fragility of runtime code injection.

5. LLM Nondeterminism

The DeepSeek API returns stochastic outputs. The same Level 2 dialogue may produce different mechanisms across runs, making exact reproduction impossible.

What Would Be Needed for Full Reproducibility

  • Fixed random seeds for training
  • n≥10 repeats per group
  • Multiple benchmarks (different model sizes, tasks)
  • Deterministic LLM outputs (fixed temperature=0, though this doesn't guarantee determinism with API-served models)

8 Compute and API Costs

Hardware

Component Value
GPU Single NVIDIA RTX 5090
Training budget per evaluation 300 seconds
Inner iterations per repeat 30
Total training time per repeat ~30 × 300s = ~2.5 hours (+ overhead)
Total experiment time 12 repeats × ~2.5 hours = ~30 hours GPU time

LLM API Costs

Level Calls per Repeat Call Type Estimated Cost
Level 1 30 Short proposal generation Low (small prompt + response)
Level 1.5 6 Search trace analysis Moderate (full trace in context)
Level 2 8 (2 sessions × 4 rounds) Long code generation Higher (full runner.py + trace + code output)
Total per repeat ~44 Mixed Estimated $1–5
Total experiment ~528 Mixed Estimated $12–60

The use of DeepSeek's API (among the cheapest frontier LLM APIs available) keeps costs remarkably low. The entire experiment likely cost less than $100 in API fees.

System Hardware LLM Provider Estimated Total Cost
Bilevel Autoresearch 1× RTX 5090 DeepSeek ~$100
Karpathy autoresearch 1× GPU Claude/GPT-4 ~$20–50 per run
AutoResearchClaw Multi-GPU Various Higher (parallel batches)
AlphaEvolve Google TPU pods Gemini (internal) Not disclosed (massive)
FunSearch Google infrastructure PaLM 2 (internal) Not disclosed (massive)

Bilevel Autoresearch is notable for achieving its results at consumer-hardware scale. The entire experiment runs on a single gaming GPU with a budget LLM API.


9 Architecture Solution

Bilevel Optimization Formalization

OUTER LEVEL (Level 2):
  Optimize φ (search mechanism = runner code)
  Objective: F(φ, θ*(φ)) = best val_bpb achieved by inner loop
  Search space: Python programs (discrete, combinatorial)
  Search method: LLM code generation (4-round dialogue)

INNER LEVEL (Level 1 + 1.5):
  Optimize θ (training configuration)
  Objective: f(θ, φ) = val_bpb after 300s training
  Search space: Hyperparameter configurations
  Search method: LLM-guided propose-evaluate-accept

Full System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                   BILEVEL AUTORESEARCH ARCHITECTURE                  │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │  LEVEL 2: MECHANISM RESEARCH                         [Green]   │  │
│  │                                                                │  │
│  │  Trigger: Every 2 outer cycles (Level 1.5 cycles)              │  │
│  │                                                                │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐      │  │
│  │  │ Round 1   │  │ Round 2   │  │ Round 3   │  │ Round 4   │      │  │
│  │  │ EXPLORE   │→│ CRITIQUE  │→│ SPECIFY   │→│ GENERATE  │      │  │
│  │  │           │  │           │  │           │  │           │      │  │
│  │  │ Survey    │  │ Evaluate  │  │ Interface │  │ Write     │      │  │
│  │  │ adjacent  │  │ against   │  │ spec:     │  │ complete  │      │  │
│  │  │ fields    │  │ observed  │  │ class,    │  │ runnable  │      │  │
│  │  │ for ideas │  │ failures  │  │ methods,  │  │ Python    │      │  │
│  │  │           │  │           │  │ hooks     │  │ code      │      │  │
│  │  └──────────┘  └──────────┘  └──────────┘  └─────┬────┘      │  │
│  │                                                    │           │  │
│  │                                          ┌─────────▼────────┐ │  │
│  │                                          │ importlib.import  │ │  │
│  │                                          │ Validate → Inject │ │  │
│  │                                          │ or Revert backup  │ │  │
│  │                                          └─────────┬────────┘ │  │
│  └────────────────────────────────────────────────────┼──────────┘  │
│                                                       │              │
│  ┌────────────────────────────────────────────────────┼──────────┐  │
│  │  LEVEL 1.5: SEARCH STRATEGY LOOP           [Amber] │          │  │
│  │                                                     │          │  │
│  │  Trigger: Every 5 inner iterations                  │          │  │
│  │                                                     │          │  │
│  │  Reads: Full proposal/outcome trace                 │          │  │
│  │  Outputs: SearchConfig {                            │          │  │
│  │    freeze: [params stalled ≥3 times]                │          │  │
│  │    unfreeze: [params not explored in new region]    │          │  │
│  │    guidance: "Prioritize under-explored params"     │          │  │
│  │  }                                                  │          │  │
│  │                                                     ▼          │  │
│  │  CANNOT change: proposal logic, acceptance rule,  ┌────────┐  │  │
│  │                  loop structure                    │ Config │  │  │
│  │                                                   └───┬────┘  │  │
│  └───────────────────────────────────────────────────────┼───────┘  │
│                                                          │          │
│  ┌───────────────────────────────────────────────────────┼───────┐  │
│  │  LEVEL 1: INNER AUTORESEARCH LOOP              [Blue] │       │  │
│  │                                                       ▼       │  │
│  │  for t in range(30):  # iteration budget              │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  LLM reads   │  ← train.py (best config)        │       │  │
│  │    │  context +    │  ← editable/frozen param lists   │       │  │
│  │    │  guidance     │  ← L1.5 guidance string          │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Propose     │  → {param: value} + hypothesis    │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Train 300s  │  → val_bpb                        │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Accept if   │  val_bpb < current_best?          │       │  │
│  │    │  improved    │  YES → update best config         │       │  │
│  │    │              │  NO  → discard, keep current      │       │  │
│  │    └─────────────┘                                    │       │  │
│  └───────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │  TASK: GPT Pretraining (50M params, RTX 5090, 300s budget)     │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Algorithm (Pseudocode)

Algorithm: Bilevel Autoresearch (Group C configuration)

Input: baseline train.py, runner φ₀, budgets T=30, K=5, M=2

θ ← baseline config
φ ← φ₀
t ← 0
outer_cycle ← 0

while t < T:
    # Level 1: Inner autoresearch iteration
    proposal ← LLM.propose(train.py, editable_params, frozen_params, guidance)
    val_bpb ← train_and_evaluate(proposal, budget=300s)
    if val_bpb < best_val_bpb:
        θ ← proposal  # Accept
        best_val_bpb ← val_bpb

    t ← t + 1

    # Level 1.5: Outer search strategy (every K iterations)
    if t mod K == 0:
        search_config ← LLM.analyze_trace(proposal_history, outcome_history)
        update(frozen_params, editable_params, guidance, search_config)
        outer_cycle ← outer_cycle + 1

        # Level 2: Mechanism research (every M outer cycles)
        if outer_cycle mod M == 0:
            mechanism_code ← LLM.research_session(
                runner_source=φ,
                search_trace=trace,
                rounds=[EXPLORE, CRITIQUE, SPECIFY, GENERATE]
            )
            backup ← copy(φ)
            try:
                φ ← inject(mechanism_code)
                importlib.import_module(φ)  # Validate
            except:
                φ ← backup  # Revert on failure

return θ, best_val_bpb

Key Architectural Properties

  1. Same-model recursion. All levels use the same DeepSeek model. Improvements must come from structural advantage, not from a more capable meta-model.

  2. Code as search space. The outer level searches over programs, not parameters. This is categorically different from classical bilevel optimization where both levels optimize over continuous spaces.

  3. Validate-and-revert safety. Level 2 code injection uses importlib validation with automatic rollback. Failed injections are non-destructive.

  4. Separation of concerns. Level 1 cannot change its own structure. Level 1.5 can redirect search but not restructure it. Only Level 2 can perform structural modifications.


10 Component Breakdown

Component 1: Level 1 Inner Loop (Standard Autoresearch)

Function: Implements the canonical Karpathy autoresearch cycle

Configuration: | Parameter | Value | | --- | --- | | Iteration budget | 30 per repeat | | Training budget | 300 seconds per evaluation | | Acceptance criterion | val_bpb < current_best | | Frozen architecture params | DEPTH=8, ASPECT_RATIO=64 | | Editable params | LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, TOTAL_BATCH_SIZE, etc. |

Behavior: Hill climbing in configuration space where the LLM's world knowledge serves as an implicit prior over promising changes. Training outcomes provide gradient-free feedback.

Limitations: The LLM exhibits mode-seeking behavior—repeatedly proposing similar changes around local optima. This is the fundamental bottleneck that Level 2 addresses.

Component 2: Level 1.5 Search Strategy Loop

Function: Redirects inner loop search without changing its structure

Trigger: Every 5 inner iterations

Operations: 1. Freeze parameters proposed ≥3 times with zero net improvement 2. Unfreeze parameters frozen but unexplored in new search regions 3. Inject guidance string prioritizing under-explored parameters

What Level 1.5 CANNOT do: - Change proposal generation logic - Modify acceptance criterion - Alter loop structure - Add new tools or mechanisms

Empirical result: Level 1.5 increases search diversity (Group B explores more parameters than Group A) but this diversity does not translate into larger improvements. In one repeat (B-R1), Level 1.5 achieved zero improvement—the worst outcome of any repeat in any group.

Component 3: Level 2 Mechanism Research

Function: Generates and injects new search mechanisms as Python code

Trigger: Every 2 outer cycles (every 10 inner iterations)

4-Round Research Session:

Round Name Input Output LLM Capability
1 Explore runner.py source + search trace Survey of candidate mechanisms from adjacent fields Cross-domain knowledge
2 Critique Candidate mechanisms from Round 1 Evaluation against observed failure modes; selection of best Analytical reasoning
3 Specify Selected mechanism Interface spec: class name, constructor args, method signatures, integration points Software design
4 Generate Specification from Round 3 Complete, runnable Python code + runner.py modifications Code generation

Safety mechanisms: - Pre-injection backup of runner.py - importlib dynamic loading validation - Automatic revert on import failure - No constraint on external imports (vulnerability: sklearn dependency caused one revert)

Success rate: 5 of 6 mechanisms passed import validation and were activated (83.3%)

Component 4: Validate-and-Revert Pipeline

inject_mechanism(mechanism_code):
  backup = copy(runner.py)
  try:
    write(runner.py, patched_code)
    module = importlib.import_module("runner")
    importlib.reload(module)
    # Verify module has expected attributes
    assert hasattr(module, expected_class)
    return SUCCESS
  except Exception:
    write(runner.py, backup)
    return REVERTED

Known vulnerability: A preliminary run was invalidated by a sys.modules registration bug that caused silent fallback—mechanisms appeared injected but the original runner actually executed. This was fixed before reported results, but highlights the danger of silent failure modes in runtime code injection.

Component 5: Search Trace Accumulator

All proposal–outcome pairs are recorded in a chronological trace:

trace = [
  {iter: 1, params: {LR: 3e-4}, hypothesis: "...", val_bpb: 1.105, accepted: True},
  {iter: 2, params: {LR: 1e-4}, hypothesis: "...", val_bpb: 1.110, accepted: False},
  ...
]

This trace serves as the primary input for both Level 1.5 (search strategy analysis) and Level 2 (mechanism research). The quality of the trace directly affects the quality of meta-optimization.


11 Core Mechanisms (Detailed)

Mechanism 1: Tabu Search Manager (Discovered by Level 2)

Source domain: Combinatorial optimization

Discovered in: Group C, Repeat 1

Result: −0.065 val_bpb improvement (best single outcome across all groups)

How it works: Maintains a list of recently-visited configurations (the "tabu list"). When the inner-loop LLM proposes a change, the Tabu Search Manager checks whether the proposed configuration is "too close" to any recently-visited one. If so, it either rejects the proposal or perturbs it to force movement away from the tabu region.

Why it helps: The LLM's default search behavior exhibits strong recency bias—it tends to propose changes similar to recent successful ones, leading to rapid convergence to local optima. Tabu search explicitly breaks this pattern by forbidding revisits, forcing the agent to explore structurally different configurations.

class TabuSearchManager:
    def __init__(self, tabu_tenure=5):
        self.tabu_list = deque(maxlen=tabu_tenure)

    def is_tabu(self, proposed_config):
        for tabu_config in self.tabu_list:
            if self._distance(proposed_config, tabu_config) < threshold:
                return True
        return False

    def update(self, accepted_config):
        self.tabu_list.append(accepted_config)

    def filter_proposal(self, proposal):
        if self.is_tabu(proposal):
            return self._perturb(proposal)  # Force exploration
        return proposal

Mechanism 2: Multi-Scale Bandit Proposer (Discovered by Level 2)

Source domain: Online learning / multi-armed bandits (MAB)

Discovered in: Group C, Repeat 2

Result: −0.011 val_bpb improvement (modest, comparable to Group A baseline)

How it works: Treats each editable parameter as a "bandit arm." Maintains per-parameter statistics (number of times proposed, cumulative improvement, UCB score). At each iteration, selects which parameters to modify based on Upper Confidence Bound exploration-exploitation tradeoff.

Why it underperformed: While theoretically sound, the Multi-Scale Bandit may have been less effective at forcing exploration of the batch size dimension specifically. The UCB exploration bonus may not have been calibrated aggressively enough for the small iteration budget (30 iterations).

class MultiScaleBanditProposer:
    def __init__(self, params):
        self.counts = {p: 0 for p in params}
        self.rewards = {p: 0.0 for p in params}

    def select_params(self, n_select=2):
        ucb_scores = {}
        total = sum(self.counts.values()) + 1
        for p in self.counts:
            if self.counts[p] == 0:
                ucb_scores[p] = float('inf')  # Explore unseen
            else:
                exploit = self.rewards[p] / self.counts[p]
                explore = sqrt(2 * log(total) / self.counts[p])
                ucb_scores[p] = exploit + explore
        return sorted(ucb_scores, key=ucb_scores.get, reverse=True)[:n_select]

Mechanism 3: Orthogonal Exploration (Discovered by Level 2)

Source domain: Design of experiments (DOE)

Discovered in: Group C, Repeat 3

Result: −0.058 val_bpb improvement (strong, second-best overall)

How it works: Generates orthogonal arrays of parameter combinations, ensuring that each pair of parameters is explored independently. This prevents the confounding that occurs when the LLM changes multiple correlated parameters simultaneously.

Why it helps: LLMs tend to propose "package deals"—changing multiple parameters together based on folk wisdom (e.g., "if you increase batch size, decrease learning rate"). Orthogonal exploration breaks these correlations, enabling the system to discover that some "conventional wisdom" parameter couplings are suboptimal for the specific task.

Mechanism 4: GP Regressor (Attempted by Level 2)

Source domain: Bayesian optimization

Discovered in: Group D, Repeat 2

Result: Reverted (required sklearn, which was not installed)

Significance: Demonstrates that Level 2 has no constraint on external dependencies. The validate-and-revert mechanism correctly handled the failure, but the exposure to arbitrary import requirements is a reliability risk.

Meta-Pattern: Why Generated Mechanisms Work

The paper identifies a unifying explanation for why Level 2 mechanisms improve over Level 1:

The generated mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid.

The LLM's implicit prior—trained on vast amounts of ML literature—creates strong biases about which hyperparameter changes are "reasonable." These biases cause:

  1. Parameter fixation: The LLM repeatedly proposes changes to the same parameters
  2. Correlation assumptions: Changes are proposed in conventional bundles (LR↓ when batch↑)
  3. Narrow exploration radius: Changes are conservative, staying close to known-good values
  4. Mode collapse: After a few successful changes, the LLM converges on a single search direction

Level 2 mechanisms break each of these patterns: - Tabu search prevents fixation (memory of visited configs) - Orthogonal exploration breaks correlations (independent parameter variation) - Bandit selection addresses fixation (UCB forces under-explored parameters)

The Mechanism-vs-Parameter Distinction

The paper's sharpest empirical finding is the contrast between Level 1.5 and Level 2:

Adjustment Type Level What Changes Mean Improvement
Parameter-level 1.5 Which params are explored, in what order −0.006 (≈ no gain)
Mechanism-level 2 How proposals are generated and filtered −0.045 (5× gain)

Adjusting what the inner loop searches is insufficient. Changing how it searches is necessary.

This distinction has profound implications for autoresearch system design: the search mechanism is the bottleneck, not the search parameters.


12 Programming Language

Component Language Framework/Library
Bilevel framework Python Custom
Level 1 inner loop Python Custom (extends Karpathy autoresearch)
Level 1.5 strategy loop Python Custom
Level 2 mechanism research Python Custom + importlib
Generated mechanisms Python Various (Tabu, Bandit, DOE, attempted sklearn)
Task (GPT training) Python PyTorch
LLM interface Python DeepSeek API

The entire system is pure Python with no multi-language complexity. The code injection mechanism relies on Python's dynamic nature (importlib, sys.modules, runtime patching).

Code Structure (Inferred from Paper + Repository)

bilevel-autoresearch/
├── runner.py                # Level 1 inner loop (target of L2 injection)
├── train.py                 # GPT training script (target of L1 optimization)
├── level15_strategy.py      # Level 1.5 search strategy logic
├── level2_research.py       # Level 2 mechanism research (4-round dialogue)
├── code_injector.py         # Validate-and-revert injection pipeline
├── search_trace.py          # Trace accumulator for proposals/outcomes
├── config.py                # SearchConfig dataclass
├── experiments/             # Experiment logs (all 12 runs)
│   ├── group_a/             # Level 1 only
│   ├── group_b/             # Level 1 + 1.5
│   ├── group_c/             # Level 1 + 1.5 + 2
│   └── group_d/             # Level 1 + 2
├── generated_mechanisms/    # Python modules generated by Level 2
│   ├── tabu_search.py       # Tabu Search Manager
│   ├── bandit_proposer.py   # Multi-Scale Bandit
│   └── orthogonal.py        # Orthogonal Exploration
└── README.md

Dynamic Code Injection Pattern

The runtime code injection is the most technically interesting aspect of the implementation:

# Level 2 generates mechanism_code as a string
mechanism_code = llm.generate_mechanism(runner_source, trace)

# Backup current runner
backup = open("runner.py").read()

# Patch runner.py with generated code
with open("runner.py", "w") as f:
    f.write(mechanism_code)

# Validate via importlib
try:
    if "runner" in sys.modules:
        del sys.modules["runner"]
    module = importlib.import_module("runner")
    # Verify expected interface exists
    assert hasattr(module, "run_inner_loop")
    print("Mechanism injected successfully")
except Exception as e:
    # Revert on any failure
    with open("runner.py", "w") as f:
        f.write(backup)
    print(f"Mechanism injection failed: {e}, reverted")

Critical subtlety: The sys.modules cleanup is essential. Without it, Python's module caching returns the old module on reimport, causing the "silent fallback" bug that invalidated a preliminary run.


13 Memory Management

Search Trace as Working Memory

The search trace serves as the system's primary working memory. It accumulates all proposals and outcomes across inner iterations:

Trace structure:
[
  {
    iteration: int,
    params_changed: dict[str, Any],
    hypothesis: str,
    val_bpb_before: float,
    val_bpb_after: float,
    accepted: bool,
    frozen_params: list[str],
    guidance: str
  },
  ...
]

Memory growth: Linear in the number of iterations. With 30 iterations per repeat, the trace remains manageable within the LLM's context window.

Level-Specific Memory

Level Memory Type Content Lifetime
Level 1 Current best config train.py snapshot Persist across iterations
Level 1 Proposal history Implicit in LLM context Per-call (no explicit memory)
Level 1.5 Search trace Full proposal/outcome history Persist across cycles
Level 1.5 SearchConfig Frozen/unfrozen params + guidance Updated every 5 iterations
Level 2 Runner source code Current runner.py Persist until next injection
Level 2 Research session 4-round dialogue context Per-session (not persistent)

No Cross-Run Memory

Unlike EvoScientist (which maintains persistent experience memory across runs), Bilevel Autoresearch has no cross-run learning. Each repeat starts fresh. This is both a limitation and a feature:

Limitation: The system cannot learn from prior experiment outcomes. Each repeat must rediscover effective mechanisms from scratch.

Feature: Results are independent, enabling clean statistical comparison across groups.

Context Window Pressure

The main memory bottleneck is the LLM context window at Level 2. The 4-round research dialogue must fit: 1. Full runner.py source code 2. Complete search trace (up to 30 iterations) 3. Round 1–3 outputs (survey, critique, specification) 4. Generated code (Round 4)

With DeepSeek's large context window, this is feasible for 30-iteration traces. Scaling to hundreds of iterations would require trace summarization or selective inclusion.

GPU Memory

Component GPU Memory
GPT model (50M params) ~200MB (BF16)
Training batch Variable (batch size is editable)
RTX 5090 total 32GB
Available headroom >30GB

GPU memory is not a bottleneck. The 50M-parameter model is tiny relative to the RTX 5090's capacity.


14 Continued Learning

Current State: No Continued Learning

Bilevel Autoresearch does not implement any form of continued learning. Each repeat is independent. The mechanisms generated by Level 2 are not accumulated across repeats or sessions.

Comparison with EvoScientist

Feature EvoScientist Bilevel Autoresearch
Cross-run memory ✅ Persistent experience memory ❌ No cross-run learning
Mechanism accumulation ❌ Fixed mechanisms ❌ Per-run mechanisms
Learning what works ✅ Summarized lessons ❌ Fresh start each run

Potential Continued Learning Extensions

1. Mechanism Library

The most natural extension: maintain a library of successfully generated mechanisms across runs.

mechanism_library = {
  "tabu_search": {code: "...", success_count: 3, avg_improvement: -0.055},
  "bandit": {code: "...", success_count: 1, avg_improvement: -0.011},
  "orthogonal": {code: "...", success_count: 2, avg_improvement: -0.047},
}

# Level 2 could:
# 1. Select from library (exploitation)
# 2. Generate new mechanism (exploration)  
# 3. Combine/modify library mechanisms (crossover)

This would convert Level 2 from pure generation to a bandit over mechanism portfolios—a form of meta-meta-optimization.

2. Cross-Run Search Trace Accumulation

Aggregate traces across runs to give Level 2 richer context about which search patterns succeed:

accumulated_traces = [
  {run_id: 1, mechanism: "tabu_search", improvement: -0.065, ...},
  {run_id: 2, mechanism: "bandit", improvement: -0.011, ...},
  ...
]

Level 2 could then condition its mechanism generation on historical successes.

3. Mechanism Evolution

Apply evolutionary strategies to the mechanism library: - Mutation: Level 2 generates variants of successful mechanisms - Crossover: Combine features from multiple mechanisms (e.g., Tabu + DOE) - Selection: Retain mechanisms that consistently improve across diverse tasks - Population: Maintain a diverse population of active mechanisms

This would make the outer loop itself evolutionary—a natural extension of the bilevel framework.

4. Multi-Task Transfer

The current system is validated on one task. Continued learning across multiple tasks would test whether mechanisms generalize: - Train a GPT model at different scales (50M, 100M, 500M) - Optimize different architectures (Transformer, Mamba, RWKV) - Apply to non-ML tasks with measurable objectives

Mechanisms that work across tasks would be "meta-transferable"—a strong form of learned optimization.

5. Bootstrapping from Prior Sessions

A practical extension for iterative research:

Session 1: Generate mechanisms, select best → persist
Session 2: Start from best mechanism of Session 1 → refine
Session 3: Start from best mechanism of Session 2 → refine
...

This sequential refinement could compound improvements across sessions, though diminishing returns are likely.

The Self-Improvement Horizon

The paper's central thesis—"autoresearch can meta-autoresearch itself"—implies a natural question: can the outer loop meta-optimize the outer loop? A Level 3 that generates improvements to Level 2's research methodology would be a third level of recursion.

The authors do not explore this direction, but the principle extends: if Level 2 can generate search mechanisms, a Level 3 could generate Level 2 research session structures (e.g., changing the 4-round dialogue to a different format, adding constraint-checking rounds, etc.).

The practical limit is likely context window exhaustion—each additional level adds overhead that compresses the effective budget for the inner level.


15 Applications

Direct Applications

1. Neural Architecture Search (NAS)

The bilevel framework naturally extends to NAS: - Inner loop: Evaluate candidate architectures - Outer loop: Generate search strategies for the architecture space - The batch size discovery in the paper (where Level 2 mechanisms unlocked batch size as a critical parameter) demonstrates this potential

2. Hyperparameter Optimization at Scale

For organizations running many training jobs: - Deploy Level 2 to discover task-specific optimization strategies - Accumulate a mechanism library across projects - Select or generate mechanisms based on task similarity

3. AutoML Pipeline Design

The bilevel principle extends beyond hyperparameters: - Inner loop: Run an AutoML pipeline (feature selection, model selection, ensembling) - Outer loop: Generate improvements to the pipeline logic itself

4. Reinforcement Learning Algorithm Discovery

RL algorithms are essentially search mechanisms over policy space: - Inner loop: Standard RL training (PPO, SAC, etc.) - Outer loop: Generate modifications to the RL algorithm (reward shaping, exploration strategies, curriculum design)

5. Scientific Experiment Design

For automated laboratories: - Inner loop: Run experiments according to a design strategy - Outer loop: Generate better experimental design strategies based on outcomes

Methodological Applications

6. Understanding LLM Search Biases

The paper provides empirical evidence of specific LLM search biases: - Recency bias: Proposals cluster near recent successful configs - Parameter fixation: Some parameters are over-explored while others are ignored - Correlation assumptions: Parameters are changed in "conventional" bundles

These findings inform the design of any LLM-guided optimization system: the LLM's priors are both its strength (domain knowledge) and its weakness (systematic blind spots).

7. Mechanism Design for Evolutionary Systems

The generated mechanisms (Tabu, Bandit, DOE) provide a vocabulary of diversity-injection strategies that apply broadly to evolutionary and LLM-guided optimization:

Mechanism Diversity Injection Pattern Applicable To
Tabu Search Memory-based visit avoidance Any iterative search
UCB Bandit Exploration-exploitation tradeoff Multi-objective optimization
Orthogonal Exploration Decorrelation of parameter changes High-dimensional search

Limitations for Applications

  1. Single benchmark: Generalization beyond GPT-50M pretraining is unproven
  2. Small sample size: n=3 is insufficient for production deployment decisions
  3. Code injection fragility: Silent fallback failures are dangerous in production
  4. No dependency management: Generated code can import arbitrary libraries
  5. Prompt-induced domain bias: Level 2 prompt suggests specific domains, limiting discovery scope
  6. No safety guarantees: Generated mechanisms could cause training instability, resource exhaustion, or other unintended effects
  7. LLM cost at scale: Each Level 2 session requires 4 LLM calls; at high outer-loop frequency, API costs accumulate

Broader Significance for the Autoresearch Field

Bilevel Autoresearch makes a conceptual contribution that extends beyond its specific results:

The fundamental bottleneck in autoresearch is not the inner loop's search parameters but its search mechanism. Improving how the system searches is categorically more valuable than improving what it searches. This insight—that mechanism design trumps parameter tuning—aligns with decades of optimization theory but had not been empirically demonstrated in the context of LLM-guided autoresearch.

The paper establishes a new axis of variation for autoresearch systems:

Axis Examples Level of Innovation
Task parameters Hyperparameters, architecture choices Low (standard optimization)
Search parameters Temperature, exploration rate, batch size Medium (meta-parameter tuning)
Search mechanisms Proposal logic, acceptance criteria, memory High (algorithm design)

Prior systems operate on the first two axes. Bilevel Autoresearch opens the third axis to autonomous optimization.


System Inner Loop Outer Loop Meta-Optimization Target LLM at Meta Level Code Injection
Karpathy autoresearch Propose-evaluate-accept ❌ None
AutoResearchClaw Multi-batch parallel ❌ None (human-designed)
EvoScientist With experience memory ❌ None (human-designed)
FunSearch LLM program generation Evolutionary selection Task programs Same LLM ❌ (evolves task code)
Bilevel Autoresearch Standard autoresearch ✅ LLM-driven Search mechanism code Same LLM ✅ Runtime injection
AlphaEvolve LLM mutation + eval MAP-Elites database Task programs Ensemble ❌ (evolves task code)
OpenEvolve LLM mutation + eval Program database Task programs Ensemble ❌ (evolves task code)

The critical distinction: FunSearch and AlphaEvolve evolve task-level programs. Bilevel Autoresearch evolves the search mechanism itself. The target of evolution is one level of abstraction higher—it optimizes the optimizer rather than the solution.


Appendix: Detailed Experimental Configuration

Training Configuration

Parameter Value Editable?
DEPTH 8 ❌ Frozen
ASPECT_RATIO 64 ❌ Frozen
LR Variable
WEIGHT_DECAY Variable
WINDOW_PATTERN Variable
HEAD_DIM Variable
TOTAL_BATCH_SIZE Variable
Training budget 300 seconds Fixed
Model size 50M parameters Fixed
Hardware RTX 5090 Fixed

Ablation Group Configuration

Group Level 1 Level 1.5 Level 2 Inner Iters Outer Period L2 Period
A 30
B 30 K=5
C 30 K=5 M=2
D 30 M=2 (triggered by iteration count)

Complete Results Table

Group Repeat Δval_bpb Mechanism Injected Notes
A R1 −0.011 Consistent small improvement
A R2 −0.009 Consistent small improvement
A R3 −0.007 Consistent small improvement
B R1 −0.000 L1.5 froze stalled params; LLM found nothing better
B R2 −0.012 Best in Group B
B R3 −0.005 Moderate
C R1 −0.065 Tabu Search Manager Best overall; mechanism broke fixation
C R2 −0.011 Multi-Scale Bandit Modest; bandit less effective for this task
C R3 −0.058 Orthogonal Exploration Strong; decorrelated parameter search
D R1 −0.065 (Mechanism not named) Matched C-R1's best
D R2 −0.001 GP Regressor (reverted) sklearn missing → fallback to L1
D R3 −0.036 (Mechanism not named) Moderate improvement