← Back to Index

Bilevel Autoresearch

A bilevel framework where an outer loop meta-optimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime Organization: Independent Research (Yaonan Qu, Meng Lu) Published: March 24, 2026 Type: paper/repo Report Type: PhD-Level Technical Analysis Report Date: April 2026

Full Title and Attribution
Authors and Team
Core Contribution
Supported Solutions
LLM Integration
Key Results
Reproducibility
Compute and API Costs
Architecture Solution
Component Breakdown
Core Mechanisms (Detailed)
Programming Language
Memory Management
Continued Learning
Applications

1 Full Title and Attribution

Full Title: Bilevel Autoresearch: Meta-Autoresearching Itself

ArXiv: arXiv:2603.23420 (cs.AI)

Repository: github.com/EdwardOptimization/Bilevel-Autoresearch

License: Open release (code and experiment logs)

Status: Research prototype with published experiment logs

Lineage: Extends the Karpathy autoresearch paradigm (2026) by adding a bilevel meta-optimization layer. Directly references AutoResearchClaw (AIMing Lab, 2026) and EvoScientist (2026) as prior systems that required human-designed improvements.

"If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop."

2 Authors and Team

Author	Affiliation	Contact
Yaonan Qu	Independent Researcher	EdwardOptimization@gmail.com
Meng Lu	Independent Researcher	menglu_16@connect.hku.hk

This is a notable contribution from independent researchers outside the major academic-industrial labs. The research demonstrates that fundamental advances in autoresearch methodology do not require institutional infrastructure—the experiments run on a single RTX 5090 GPU with a commodity LLM API (DeepSeek).

Meng Lu's HKU email suggests a connection to the University of Hong Kong, though the paper lists "Independent Researcher" as the affiliation for both authors.

3 Core Contribution

The Foundational Observation

Every existing autoresearch system was improved by a human who: 1. Read the system's code 2. Identified a bottleneck 3. Wrote new code to address it

System	Human-Designed Improvement	Year
Karpathy autoresearch	Single-track propose-evaluate-accept loop	2026
AutoResearchClaw	Multi-batch parallel search (branching factor ↑)	2026
EvoScientist	Persistent experience memory across runs	2026

The authors ask: can an LLM perform that same design step autonomously?

The Answer: Bilevel Optimization

The paper formalizes autoresearch as a bilevel optimization problem:

min_φ  F(φ, θ*(φ))           # Outer: optimize search mechanism φ
s.t.   θ*(φ) ∈ argmin_θ f(θ, φ)   # Inner: optimize task performance θ

where: - φ = the search mechanism (runner code)—a discrete artifact produced by code generation - θ = the task performance (training configuration) - f = the inner objective (validation loss) - F = the outer objective (best validation loss achieved by the inner loop)

The key departure from classical bilevel optimization: φ is a program, not a parameter vector. The outer level generates and replaces code rather than adjusting continuous parameters.

Three-Level Architecture

┌─────────────────────────────────────────────────────────────┐
│  LEVEL 2: Mechanism Research (Meta-Autoresearch)            │
│  Generates new Python search mechanisms via 4-round         │
│  LLM dialogue. Injects code at runtime.                     │
│  Executes: every 2 outer cycles                             │
│                                                             │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  LEVEL 1.5: Search Strategy Adjustment                │  │
│  │  Freezes/unfreezes parameters. Injects guidance.      │  │
│  │  Executes: every 5 inner iterations                   │  │
│  │                                                       │  │
│  │  ┌─────────────────────────────────────────────────┐  │  │
│  │  │  LEVEL 1: Inner Autoresearch Loop               │  │  │
│  │  │  Standard propose → train → evaluate → accept   │  │  │
│  │  │  Executes: every iteration (30 per repeat)      │  │  │
│  │  └─────────────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Core Result

Level 2 (mechanism research) achieves a 5× improvement over Level 1 alone: −0.045 vs. −0.009 val_bpb on Karpathy's GPT pretraining benchmark. Level 1.5 (parameter adjustment) yields no reliable gain over Level 1 alone.

The generated mechanisms—Tabu Search, Multi-Scale Bandit, Orthogonal Exploration—are autonomously discovered from combinatorial optimization, online learning, and design of experiments, without human specification of which domains to explore.

4 Supported Solutions

Solution Type	Support Level	Details
Neural network hyperparameter optimization	Primary target	GPT pretraining configuration search
Autoresearch loop improvement	Core mechanism	Self-modification of search logic
Search mechanism generation	Core mechanism	Novel Python algorithms generated at runtime
Bilevel optimization	Framework	Any problem expressible as nested optimization
Algorithm discovery	Demonstrated	Tabu Search, Bandits, DOE discovered autonomously

Task Scope

The current implementation targets a specific but representative task:

Property	Value
Task	GPT pretraining optimization
Model size	50M parameters
Training budget	300 seconds per evaluation
Hardware	Single RTX 5090
Metric	Validation bits-per-byte (val_bpb)
Editable parameters	LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, TOTAL_BATCH_SIZE, etc.
Frozen parameters	DEPTH=8, ASPECT_RATIO=64 (architecture size locked)

Generality Claim

The paper's central thesis is that the bilevel framework is task-agnostic: any problem with a measurable objective can be meta-autoresearched. The authors explicitly state:

"The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective."

However, this generality is unproven beyond the single benchmark. The authors acknowledge this in their limitations section.

5 LLM Integration

Single-Model Architecture

A defining feature of Bilevel Autoresearch is that all three levels use the same LLM:

Property	Value
Model	DeepSeek deepseek-chat
Provider	DeepSeek API
Used at Level 1	Proposes hyperparameter changes
Used at Level 1.5	Generates SearchConfig updates
Used at Level 2	4-round research dialogue generating Python code
Stronger model required?	No — improvements come from architecture, not model capability

"Both loops use the same LLM—no stronger model is needed at the meta level."

This is a significant design choice. It means the outer loop cannot succeed by simply being "smarter"—it must succeed through structural advantage (bilevel architecture, code injection, mechanism diversity).

Level 1: LLM as Configuration Proposer

At each inner iteration, the LLM receives: 1. Current train.py (frozen at best accepted configuration) 2. List of active editable parameters 3. List of frozen parameters 4. Strategic guidance string (injected by Level 1.5)

The LLM proposes: - A set of parameter name–value pairs - A one-sentence hypothesis explaining the rationale

Input to LLM (Level 1):
  - Current train.py (best config)
  - Editable: [LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, ...]
  - Frozen: [DEPTH=8, ASPECT_RATIO=64]
  - Guidance: "Prioritize batch size exploration; LR has stalled"

Output from LLM:
  - Changes: {TOTAL_BATCH_SIZE: 128 → 256, LR: 3e-4 → 1e-4}
  - Hypothesis: "Larger batch with lower LR may stabilize training"

Level 1.5: LLM as Search Strategist

Every 5 inner iterations, the LLM receives the full trace of proposals and outcomes and generates a SearchConfig update:

Freeze parameters proposed ≥k times (default k=3) with zero net improvement
Unfreeze parameters that were frozen but haven't been explored since the search moved to a new region
Inject guidance instructing the inner loop to prioritize under-explored parameters

Critical constraint: Level 1.5 cannot change the proposal generation logic, acceptance criterion, or loop structure. These require Level 2.

Level 2: LLM as Mechanism Researcher

Level 2 is the most novel LLM integration. It conducts a 4-round structured dialogue, each round a single LLM call:

Round 1: EXPLORE
  Input:  Full runner.py source code + search trace
  Output: Survey of mechanisms from adjacent fields
          (combinatorial optimization, online learning, DOE, 
           Bayesian optimization)

Round 2: CRITIQUE  
  Input:  Candidate mechanisms from Round 1
  Output: Evaluation against observed failure modes
          (e.g., repetitive proposals, parameter fixation)
          Selection of most promising mechanism

Round 3: SPECIFY
  Input:  Selected mechanism from Round 2
  Output: Precise interface specification:
          - Class name
          - Constructor arguments
          - Key methods with signatures
          - Integration points in runner.py

Round 4: GENERATE
  Input:  Specification from Round 3
  Output: Complete, runnable Python code implementing the
          mechanism + modifications to runner.py

Each Level 2 session takes approximately 3 minutes of wall time (four LLM calls). Code is validated via importlib dynamic loading before activation. If import fails, the original runner is restored from backup.

LLM Capabilities Exercised

Capability	Level 1	Level 1.5	Level 2
Hyperparameter reasoning	✅	—	—
Search trace analysis	—	✅	✅
Code comprehension	—	—	✅
Cross-domain knowledge	—	—	✅
Interface design	—	—	✅
Python code generation	—	—	✅
Self-critique	—	—	✅

6 Key Results

Four-Group Ablation Design

The experiment uses a controlled four-group ablation with 3 repeats per group:

Group	Levels Active	Description
A (Baseline)	Level 1 only	Standard inner autoresearch loop
B	Level 1 + 1.5	Inner loop + parameter-level adjustment
C	Level 1 + 1.5 + 2	Full bilevel system
D	Level 1 + 2	Inner loop + mechanism research (no Level 1.5)

Primary Results

Group	Mean Δval_bpb	Std Dev	Best Single	Worst Single
A (L1 only)	−0.009	±0.002	−0.011	−0.007
B (L1 + L1.5)	−0.006	±0.006	−0.012	−0.000
C (L1 + L1.5 + L2)	−0.045	±0.030	−0.065	−0.011
D (L1 + L2)	−0.034	±0.031	−0.065	−0.001

Group C achieves a 5× improvement over Group A (−0.045 vs. −0.009). The improvement is driven entirely by Level 2 mechanism generation. Level 1.5 provides no reliable gain over Level 1 alone.

Hypothesis Testing

Hypothesis	Result	Evidence
H1: Group B > Group A	Not supported	B mean (−0.006) actually slightly worse than A (−0.009); B's R1 achieved −0.000
H2: Group C >> Group B	Supported	C mean (−0.045) is 7.5× B's (−0.006), with large separation
H3: L2 discovers novel mechanisms	Supported	Three distinct domains discovered (combinatorial opt, online learning, DOE)
H4: Group D ≈ Group C	Supported (with caveats)	D mean (−0.034) close to C (−0.045), but D has a zero-improvement repeat

Mechanisms Autonomously Discovered

Mechanism	Domain	Repeat	Injection Success	Effect
Tabu Search Manager	Combinatorial Optimization	C-R1	✅	−0.065 (best overall)
Multi-Scale Bandit Proposer	Online Learning (MAB)	C-R2	✅	−0.011 (modest)
Orthogonal Exploration	Design of Experiments	C-R3	✅	−0.058 (strong)
GP Regressor	Bayesian Optimization	D-R2	❌ (missing sklearn)	Reverted
(Two additional mechanisms)	Various	D-R1, D-R3	✅	Variable

Key insight: The discovered mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid. The LLM's default proposals exhibit strong mode-seeking behavior (repeatedly proposing similar changes); the meta-generated mechanisms inject structured diversity.

Per-Repeat Breakdown

Group A (Level 1 only):
  R1: −0.011  R2: −0.009  R3: −0.007
  Pattern: Consistent small improvements, tight variance

Group B (Level 1 + 1.5):
  R1: −0.000  R2: −0.012  R3: −0.005
  Pattern: High variance, L1.5 sometimes hurts (R1 = zero improvement)

Group C (Level 1 + 1.5 + 2):
  R1: −0.065  R2: −0.011  R3: −0.058
  Pattern: Two dramatic improvements, one modest; R2 underperformed

Group D (Level 1 + 2):
  R1: −0.065  R2: −0.001  R3: −0.036
  Pattern: Similar to C but slightly less robust; R2 near-zero

Why Group C-R2 Underperformed

The Multi-Scale Bandit Proposer generated for C-R2, while valid and correctly injected, was less effective at forcing exploration of the batch size dimension compared to Tabu Search (R1) or Orthogonal Exploration (R3). Additionally, each Level 2 session consumes ~3 minutes of wall time (four LLM calls), reducing effective inner iterations. With two sessions per repeat, Group C has ~6 minutes less search time.

7 Reproducibility

What Is Released

Artifact	Location	Description
Source code	GitHub	Full implementation including all three levels
Experiment logs	GitHub	Complete traces for all 12 runs (4 groups × 3 repeats)
Generated mechanisms	In logs	Python code for Tabu Search, Bandit, Orthogonal Exploration, etc.

Reproducibility Challenges

The paper is transparent about significant reproducibility limitations:

1. Small Sample Size (n=3)

"Three repeats per group is insufficient for rigorous statistical comparison. Group C's standard deviation (±0.030) is 67% of its absolute mean, indicating high variability. Reliable estimates would require n≥10 repeats per group."

With only 3 repeats, the results are suggestive but not statistically robust. The variance is large enough that Group C's superiority, while dramatic in magnitude, has wide confidence intervals.

2. Baseline Variance

Baseline val_bpb varies across repeats (1.094–1.114) due to training randomness from data ordering and weight initialization. Using Δ = best − baseline mitigates but does not eliminate this confound.

3. Single Benchmark

All results are on one task: GPT pretraining at 50M parameters with a 300-second budget on RTX 5090. Generalization to other model sizes, training budgets, or tasks is unproven.

4. Dynamic Loading Fragility

"A preliminary run was invalidated because the Level 2 dynamic loading pipeline contained a sys.modules registration bug, causing all mechanism injections to silently fall back to the original runner."

This "silent fallback" failure mode is particularly dangerous: the system appears to inject mechanisms but actually runs the original code. The bug was fixed before the reported results, but it highlights the fragility of runtime code injection.

5. LLM Nondeterminism

The DeepSeek API returns stochastic outputs. The same Level 2 dialogue may produce different mechanisms across runs, making exact reproduction impossible.

What Would Be Needed for Full Reproducibility

Fixed random seeds for training
n≥10 repeats per group
Multiple benchmarks (different model sizes, tasks)
Deterministic LLM outputs (fixed temperature=0, though this doesn't guarantee determinism with API-served models)

8 Compute and API Costs

Hardware

Component	Value
GPU	Single NVIDIA RTX 5090
Training budget per evaluation	300 seconds
Inner iterations per repeat	30
Total training time per repeat	~30 × 300s = ~2.5 hours (+ overhead)
Total experiment time	12 repeats × ~2.5 hours = ~30 hours GPU time

LLM API Costs

Level	Calls per Repeat	Call Type	Estimated Cost
Level 1	30	Short proposal generation	Low (small prompt + response)
Level 1.5	6	Search trace analysis	Moderate (full trace in context)
Level 2	8 (2 sessions × 4 rounds)	Long code generation	Higher (full runner.py + trace + code output)
Total per repeat	~44	Mixed	Estimated $1–5
Total experiment	~528	Mixed	Estimated $12–60

The use of DeepSeek's API (among the cheapest frontier LLM APIs available) keeps costs remarkably low. The entire experiment likely cost less than $100 in API fees.

System	Hardware	LLM Provider	Estimated Total Cost
Bilevel Autoresearch	1× RTX 5090	DeepSeek	~$100
Karpathy autoresearch	1× GPU	Claude/GPT-4	~$20–50 per run
AutoResearchClaw	Multi-GPU	Various	Higher (parallel batches)
AlphaEvolve	Google TPU pods	Gemini (internal)	Not disclosed (massive)
FunSearch	Google infrastructure	PaLM 2 (internal)	Not disclosed (massive)

Bilevel Autoresearch is notable for achieving its results at consumer-hardware scale. The entire experiment runs on a single gaming GPU with a budget LLM API.

9 Architecture Solution

Bilevel Optimization Formalization

OUTER LEVEL (Level 2):
  Optimize φ (search mechanism = runner code)
  Objective: F(φ, θ*(φ)) = best val_bpb achieved by inner loop
  Search space: Python programs (discrete, combinatorial)
  Search method: LLM code generation (4-round dialogue)

INNER LEVEL (Level 1 + 1.5):
  Optimize θ (training configuration)
  Objective: f(θ, φ) = val_bpb after 300s training
  Search space: Hyperparameter configurations
  Search method: LLM-guided propose-evaluate-accept

Full System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                   BILEVEL AUTORESEARCH ARCHITECTURE                  │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │  LEVEL 2: MECHANISM RESEARCH                         [Green]   │  │
│  │                                                                │  │
│  │  Trigger: Every 2 outer cycles (Level 1.5 cycles)              │  │
│  │                                                                │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐      │  │
│  │  │ Round 1   │  │ Round 2   │  │ Round 3   │  │ Round 4   │      │  │
│  │  │ EXPLORE   │→│ CRITIQUE  │→│ SPECIFY   │→│ GENERATE  │      │  │
│  │  │           │  │           │  │           │  │           │      │  │
│  │  │ Survey    │  │ Evaluate  │  │ Interface │  │ Write     │      │  │
│  │  │ adjacent  │  │ against   │  │ spec:     │  │ complete  │      │  │
│  │  │ fields    │  │ observed  │  │ class,    │  │ runnable  │      │  │
│  │  │ for ideas │  │ failures  │  │ methods,  │  │ Python    │      │  │
│  │  │           │  │           │  │ hooks     │  │ code      │      │  │
│  │  └──────────┘  └──────────┘  └──────────┘  └─────┬────┘      │  │
│  │                                                    │           │  │
│  │                                          ┌─────────▼────────┐ │  │
│  │                                          │ importlib.import  │ │  │
│  │                                          │ Validate → Inject │ │  │
│  │                                          │ or Revert backup  │ │  │
│  │                                          └─────────┬────────┘ │  │
│  └────────────────────────────────────────────────────┼──────────┘  │
│                                                       │              │
│  ┌────────────────────────────────────────────────────┼──────────┐  │
│  │  LEVEL 1.5: SEARCH STRATEGY LOOP           [Amber] │          │  │
│  │                                                     │          │  │
│  │  Trigger: Every 5 inner iterations                  │          │  │
│  │                                                     │          │  │
│  │  Reads: Full proposal/outcome trace                 │          │  │
│  │  Outputs: SearchConfig {                            │          │  │
│  │    freeze: [params stalled ≥3 times]                │          │  │
│  │    unfreeze: [params not explored in new region]    │          │  │
│  │    guidance: "Prioritize under-explored params"     │          │  │
│  │  }                                                  │          │  │
│  │                                                     ▼          │  │
│  │  CANNOT change: proposal logic, acceptance rule,  ┌────────┐  │  │
│  │                  loop structure                    │ Config │  │  │
│  │                                                   └───┬────┘  │  │
│  └───────────────────────────────────────────────────────┼───────┘  │
│                                                          │          │
│  ┌───────────────────────────────────────────────────────┼───────┐  │
│  │  LEVEL 1: INNER AUTORESEARCH LOOP              [Blue] │       │  │
│  │                                                       ▼       │  │
│  │  for t in range(30):  # iteration budget              │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  LLM reads   │  ← train.py (best config)        │       │  │
│  │    │  context +    │  ← editable/frozen param lists   │       │  │
│  │    │  guidance     │  ← L1.5 guidance string          │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Propose     │  → {param: value} + hypothesis    │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Train 300s  │  → val_bpb                        │       │  │
│  │    └──────┬──────┘                                    │       │  │
│  │           ▼                                           │       │  │
│  │    ┌─────────────┐                                    │       │  │
│  │    │  Accept if   │  val_bpb < current_best?          │       │  │
│  │    │  improved    │  YES → update best config         │       │  │
│  │    │              │  NO  → discard, keep current      │       │  │
│  │    └─────────────┘                                    │       │  │
│  └───────────────────────────────────────────────────────────────┘  │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────────┐  │
│  │  TASK: GPT Pretraining (50M params, RTX 5090, 300s budget)     │  │
│  └────────────────────────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────────┘

Algorithm (Pseudocode)

Algorithm: Bilevel Autoresearch (Group C configuration)

Input: baseline train.py, runner φ₀, budgets T=30, K=5, M=2

θ ← baseline config
φ ← φ₀
t ← 0
outer_cycle ← 0

while t < T:
    # Level 1: Inner autoresearch iteration
    proposal ← LLM.propose(train.py, editable_params, frozen_params, guidance)
    val_bpb ← train_and_evaluate(proposal, budget=300s)
    if val_bpb < best_val_bpb:
        θ ← proposal  # Accept
        best_val_bpb ← val_bpb

    t ← t + 1

    # Level 1.5: Outer search strategy (every K iterations)
    if t mod K == 0:
        search_config ← LLM.analyze_trace(proposal_history, outcome_history)
        update(frozen_params, editable_params, guidance, search_config)
        outer_cycle ← outer_cycle + 1

        # Level 2: Mechanism research (every M outer cycles)
        if outer_cycle mod M == 0:
            mechanism_code ← LLM.research_session(
                runner_source=φ,
                search_trace=trace,
                rounds=[EXPLORE, CRITIQUE, SPECIFY, GENERATE]
            )
            backup ← copy(φ)
            try:
                φ ← inject(mechanism_code)
                importlib.import_module(φ)  # Validate
            except:
                φ ← backup  # Revert on failure

return θ, best_val_bpb

Key Architectural Properties

Same-model recursion. All levels use the same DeepSeek model. Improvements must come from structural advantage, not from a more capable meta-model.
Code as search space. The outer level searches over programs, not parameters. This is categorically different from classical bilevel optimization where both levels optimize over continuous spaces.
Validate-and-revert safety. Level 2 code injection uses importlib validation with automatic rollback. Failed injections are non-destructive.
Separation of concerns. Level 1 cannot change its own structure. Level 1.5 can redirect search but not restructure it. Only Level 2 can perform structural modifications.

10 Component Breakdown

Component 1: Level 1 Inner Loop (Standard Autoresearch)

Function: Implements the canonical Karpathy autoresearch cycle

Configuration: | Parameter | Value | | --- | --- | | Iteration budget | 30 per repeat | | Training budget | 300 seconds per evaluation | | Acceptance criterion | val_bpb < current_best | | Frozen architecture params | DEPTH=8, ASPECT_RATIO=64 | | Editable params | LR, WEIGHT_DECAY, WINDOW_PATTERN, HEAD_DIM, TOTAL_BATCH_SIZE, etc. |

Behavior: Hill climbing in configuration space where the LLM's world knowledge serves as an implicit prior over promising changes. Training outcomes provide gradient-free feedback.

Limitations: The LLM exhibits mode-seeking behavior—repeatedly proposing similar changes around local optima. This is the fundamental bottleneck that Level 2 addresses.

Component 2: Level 1.5 Search Strategy Loop

Function: Redirects inner loop search without changing its structure

Trigger: Every 5 inner iterations

Operations: 1. Freeze parameters proposed ≥3 times with zero net improvement 2. Unfreeze parameters frozen but unexplored in new search regions 3. Inject guidance string prioritizing under-explored parameters

What Level 1.5 CANNOT do: - Change proposal generation logic - Modify acceptance criterion - Alter loop structure - Add new tools or mechanisms

Empirical result: Level 1.5 increases search diversity (Group B explores more parameters than Group A) but this diversity does not translate into larger improvements. In one repeat (B-R1), Level 1.5 achieved zero improvement—the worst outcome of any repeat in any group.

Component 3: Level 2 Mechanism Research

Function: Generates and injects new search mechanisms as Python code

Trigger: Every 2 outer cycles (every 10 inner iterations)

4-Round Research Session:

Round	Name	Input	Output	LLM Capability
1	Explore	runner.py source + search trace	Survey of candidate mechanisms from adjacent fields	Cross-domain knowledge
2	Critique	Candidate mechanisms from Round 1	Evaluation against observed failure modes; selection of best	Analytical reasoning
3	Specify	Selected mechanism	Interface spec: class name, constructor args, method signatures, integration points	Software design
4	Generate	Specification from Round 3	Complete, runnable Python code + runner.py modifications	Code generation

Safety mechanisms: - Pre-injection backup of runner.py - importlib dynamic loading validation - Automatic revert on import failure - No constraint on external imports (vulnerability: sklearn dependency caused one revert)

Success rate: 5 of 6 mechanisms passed import validation and were activated (83.3%)

Component 4: Validate-and-Revert Pipeline

inject_mechanism(mechanism_code):
  backup = copy(runner.py)
  try:
    write(runner.py, patched_code)
    module = importlib.import_module("runner")
    importlib.reload(module)
    # Verify module has expected attributes
    assert hasattr(module, expected_class)
    return SUCCESS
  except Exception:
    write(runner.py, backup)
    return REVERTED

Known vulnerability: A preliminary run was invalidated by a sys.modules registration bug that caused silent fallback—mechanisms appeared injected but the original runner actually executed. This was fixed before reported results, but highlights the danger of silent failure modes in runtime code injection.

Component 5: Search Trace Accumulator

All proposal–outcome pairs are recorded in a chronological trace:

trace = [
  {iter: 1, params: {LR: 3e-4}, hypothesis: "...", val_bpb: 1.105, accepted: True},
  {iter: 2, params: {LR: 1e-4}, hypothesis: "...", val_bpb: 1.110, accepted: False},
  ...
]

This trace serves as the primary input for both Level 1.5 (search strategy analysis) and Level 2 (mechanism research). The quality of the trace directly affects the quality of meta-optimization.

11 Core Mechanisms (Detailed)

Mechanism 1: Tabu Search Manager (Discovered by Level 2)

Source domain: Combinatorial optimization

Discovered in: Group C, Repeat 1

Result: −0.065 val_bpb improvement (best single outcome across all groups)

How it works: Maintains a list of recently-visited configurations (the "tabu list"). When the inner-loop LLM proposes a change, the Tabu Search Manager checks whether the proposed configuration is "too close" to any recently-visited one. If so, it either rejects the proposal or perturbs it to force movement away from the tabu region.

Why it helps: The LLM's default search behavior exhibits strong recency bias—it tends to propose changes similar to recent successful ones, leading to rapid convergence to local optima. Tabu search explicitly breaks this pattern by forbidding revisits, forcing the agent to explore structurally different configurations.

class TabuSearchManager:
    def __init__(self, tabu_tenure=5):
        self.tabu_list = deque(maxlen=tabu_tenure)

    def is_tabu(self, proposed_config):
        for tabu_config in self.tabu_list:
            if self._distance(proposed_config, tabu_config) < threshold:
                return True
        return False

    def update(self, accepted_config):
        self.tabu_list.append(accepted_config)

    def filter_proposal(self, proposal):
        if self.is_tabu(proposal):
            return self._perturb(proposal)  # Force exploration
        return proposal

Mechanism 2: Multi-Scale Bandit Proposer (Discovered by Level 2)

Source domain: Online learning / multi-armed bandits (MAB)

Discovered in: Group C, Repeat 2

Result: −0.011 val_bpb improvement (modest, comparable to Group A baseline)

How it works: Treats each editable parameter as a "bandit arm." Maintains per-parameter statistics (number of times proposed, cumulative improvement, UCB score). At each iteration, selects which parameters to modify based on Upper Confidence Bound exploration-exploitation tradeoff.

Why it underperformed: While theoretically sound, the Multi-Scale Bandit may have been less effective at forcing exploration of the batch size dimension specifically. The UCB exploration bonus may not have been calibrated aggressively enough for the small iteration budget (30 iterations).

class MultiScaleBanditProposer:
    def __init__(self, params):
        self.counts = {p: 0 for p in params}
        self.rewards = {p: 0.0 for p in params}

    def select_params(self, n_select=2):
        ucb_scores = {}
        total = sum(self.counts.values()) + 1
        for p in self.counts:
            if self.counts[p] == 0:
                ucb_scores[p] = float('inf')  # Explore unseen
            else:
                exploit = self.rewards[p] / self.counts[p]
                explore = sqrt(2 * log(total) / self.counts[p])
                ucb_scores[p] = exploit + explore
        return sorted(ucb_scores, key=ucb_scores.get, reverse=True)[:n_select]

Mechanism 3: Orthogonal Exploration (Discovered by Level 2)

Source domain: Design of experiments (DOE)

Discovered in: Group C, Repeat 3

Result: −0.058 val_bpb improvement (strong, second-best overall)

How it works: Generates orthogonal arrays of parameter combinations, ensuring that each pair of parameters is explored independently. This prevents the confounding that occurs when the LLM changes multiple correlated parameters simultaneously.

Why it helps: LLMs tend to propose "package deals"—changing multiple parameters together based on folk wisdom (e.g., "if you increase batch size, decrease learning rate"). Orthogonal exploration breaks these correlations, enabling the system to discover that some "conventional wisdom" parameter couplings are suboptimal for the specific task.

Mechanism 4: GP Regressor (Attempted by Level 2)

Source domain: Bayesian optimization

Discovered in: Group D, Repeat 2

Result: Reverted (required sklearn, which was not installed)

Significance: Demonstrates that Level 2 has no constraint on external dependencies. The validate-and-revert mechanism correctly handled the failure, but the exposure to arbitrary import requirements is a reliability risk.

Meta-Pattern: Why Generated Mechanisms Work

The paper identifies a unifying explanation for why Level 2 mechanisms improve over Level 1:

The generated mechanisms succeed by breaking the inner loop's deterministic search patterns, forcing exploration of directions the LLM's priors systematically avoid.

The LLM's implicit prior—trained on vast amounts of ML literature—creates strong biases about which hyperparameter changes are "reasonable." These biases cause:

Parameter fixation: The LLM repeatedly proposes changes to the same parameters
Correlation assumptions: Changes are proposed in conventional bundles (LR↓ when batch↑)
Narrow exploration radius: Changes are conservative, staying close to known-good values
Mode collapse: After a few successful changes, the LLM converges on a single search direction

Level 2 mechanisms break each of these patterns: - Tabu search prevents fixation (memory of visited configs) - Orthogonal exploration breaks correlations (independent parameter variation) - Bandit selection addresses fixation (UCB forces under-explored parameters)

The Mechanism-vs-Parameter Distinction

The paper's sharpest empirical finding is the contrast between Level 1.5 and Level 2:

Adjustment Type	Level	What Changes	Mean Improvement
Parameter-level	1.5	Which params are explored, in what order	−0.006 (≈ no gain)
Mechanism-level	2	How proposals are generated and filtered	−0.045 (5× gain)

Adjusting what the inner loop searches is insufficient. Changing how it searches is necessary.

This distinction has profound implications for autoresearch system design: the search mechanism is the bottleneck, not the search parameters.

12 Programming Language

Component	Language	Framework/Library
Bilevel framework	Python	Custom
Level 1 inner loop	Python	Custom (extends Karpathy autoresearch)
Level 1.5 strategy loop	Python	Custom
Level 2 mechanism research	Python	Custom + importlib
Generated mechanisms	Python	Various (Tabu, Bandit, DOE, attempted sklearn)
Task (GPT training)	Python	PyTorch
LLM interface	Python	DeepSeek API

The entire system is pure Python with no multi-language complexity. The code injection mechanism relies on Python's dynamic nature (importlib, sys.modules, runtime patching).

Code Structure (Inferred from Paper + Repository)

bilevel-autoresearch/
├── runner.py                # Level 1 inner loop (target of L2 injection)
├── train.py                 # GPT training script (target of L1 optimization)
├── level15_strategy.py      # Level 1.5 search strategy logic
├── level2_research.py       # Level 2 mechanism research (4-round dialogue)
├── code_injector.py         # Validate-and-revert injection pipeline
├── search_trace.py          # Trace accumulator for proposals/outcomes
├── config.py                # SearchConfig dataclass
├── experiments/             # Experiment logs (all 12 runs)
│   ├── group_a/             # Level 1 only
│   ├── group_b/             # Level 1 + 1.5
│   ├── group_c/             # Level 1 + 1.5 + 2
│   └── group_d/             # Level 1 + 2
├── generated_mechanisms/    # Python modules generated by Level 2
│   ├── tabu_search.py       # Tabu Search Manager
│   ├── bandit_proposer.py   # Multi-Scale Bandit
│   └── orthogonal.py        # Orthogonal Exploration
└── README.md

Dynamic Code Injection Pattern

The runtime code injection is the most technically interesting aspect of the implementation:

# Level 2 generates mechanism_code as a string
mechanism_code = llm.generate_mechanism(runner_source, trace)

# Backup current runner
backup = open("runner.py").read()

# Patch runner.py with generated code
with open("runner.py", "w") as f:
    f.write(mechanism_code)

# Validate via importlib
try:
    if "runner" in sys.modules:
        del sys.modules["runner"]
    module = importlib.import_module("runner")
    # Verify expected interface exists
    assert hasattr(module, "run_inner_loop")
    print("Mechanism injected successfully")
except Exception as e:
    # Revert on any failure
    with open("runner.py", "w") as f:
        f.write(backup)
    print(f"Mechanism injection failed: {e}, reverted")

Critical subtlety: The sys.modules cleanup is essential. Without it, Python's module caching returns the old module on reimport, causing the "silent fallback" bug that invalidated a preliminary run.

13 Memory Management

Search Trace as Working Memory

The search trace serves as the system's primary working memory. It accumulates all proposals and outcomes across inner iterations:

Trace structure:
[
  {
    iteration: int,
    params_changed: dict[str, Any],
    hypothesis: str,
    val_bpb_before: float,
    val_bpb_after: float,
    accepted: bool,
    frozen_params: list[str],
    guidance: str
  },
  ...
]

Memory growth: Linear in the number of iterations. With 30 iterations per repeat, the trace remains manageable within the LLM's context window.

Level-Specific Memory

Level	Memory Type	Content	Lifetime
Level 1	Current best config	`train.py` snapshot	Persist across iterations
Level 1	Proposal history	Implicit in LLM context	Per-call (no explicit memory)
Level 1.5	Search trace	Full proposal/outcome history	Persist across cycles
Level 1.5	SearchConfig	Frozen/unfrozen params + guidance	Updated every 5 iterations
Level 2	Runner source code	Current `runner.py`	Persist until next injection
Level 2	Research session	4-round dialogue context	Per-session (not persistent)

No Cross-Run Memory

Unlike EvoScientist (which maintains persistent experience memory across runs), Bilevel Autoresearch has no cross-run learning. Each repeat starts fresh. This is both a limitation and a feature:

Limitation: The system cannot learn from prior experiment outcomes. Each repeat must rediscover effective mechanisms from scratch.

Feature: Results are independent, enabling clean statistical comparison across groups.

Context Window Pressure

The main memory bottleneck is the LLM context window at Level 2. The 4-round research dialogue must fit: 1. Full runner.py source code 2. Complete search trace (up to 30 iterations) 3. Round 1–3 outputs (survey, critique, specification) 4. Generated code (Round 4)

With DeepSeek's large context window, this is feasible for 30-iteration traces. Scaling to hundreds of iterations would require trace summarization or selective inclusion.

GPU Memory

Component	GPU Memory
GPT model (50M params)	~200MB (BF16)
Training batch	Variable (batch size is editable)
RTX 5090 total	32GB
Available headroom	>30GB

GPU memory is not a bottleneck. The 50M-parameter model is tiny relative to the RTX 5090's capacity.

14 Continued Learning

Current State: No Continued Learning

Bilevel Autoresearch does not implement any form of continued learning. Each repeat is independent. The mechanisms generated by Level 2 are not accumulated across repeats or sessions.

Comparison with EvoScientist

Feature	EvoScientist	Bilevel Autoresearch
Cross-run memory	✅ Persistent experience memory	❌ No cross-run learning
Mechanism accumulation	❌ Fixed mechanisms	❌ Per-run mechanisms
Learning what works	✅ Summarized lessons	❌ Fresh start each run

Potential Continued Learning Extensions

1. Mechanism Library

The most natural extension: maintain a library of successfully generated mechanisms across runs.

mechanism_library = {
  "tabu_search": {code: "...", success_count: 3, avg_improvement: -0.055},
  "bandit": {code: "...", success_count: 1, avg_improvement: -0.011},
  "orthogonal": {code: "...", success_count: 2, avg_improvement: -0.047},
}

# Level 2 could:
# 1. Select from library (exploitation)
# 2. Generate new mechanism (exploration)  
# 3. Combine/modify library mechanisms (crossover)

This would convert Level 2 from pure generation to a bandit over mechanism portfolios—a form of meta-meta-optimization.

2. Cross-Run Search Trace Accumulation

Aggregate traces across runs to give Level 2 richer context about which search patterns succeed:

accumulated_traces = [
  {run_id: 1, mechanism: "tabu_search", improvement: -0.065, ...},
  {run_id: 2, mechanism: "bandit", improvement: -0.011, ...},
  ...
]

Level 2 could then condition its mechanism generation on historical successes.

3. Mechanism Evolution

Apply evolutionary strategies to the mechanism library: - Mutation: Level 2 generates variants of successful mechanisms - Crossover: Combine features from multiple mechanisms (e.g., Tabu + DOE) - Selection: Retain mechanisms that consistently improve across diverse tasks - Population: Maintain a diverse population of active mechanisms

This would make the outer loop itself evolutionary—a natural extension of the bilevel framework.

4. Multi-Task Transfer

The current system is validated on one task. Continued learning across multiple tasks would test whether mechanisms generalize: - Train a GPT model at different scales (50M, 100M, 500M) - Optimize different architectures (Transformer, Mamba, RWKV) - Apply to non-ML tasks with measurable objectives

Mechanisms that work across tasks would be "meta-transferable"—a strong form of learned optimization.

5. Bootstrapping from Prior Sessions

A practical extension for iterative research:

Session 1: Generate mechanisms, select best → persist
Session 2: Start from best mechanism of Session 1 → refine
Session 3: Start from best mechanism of Session 2 → refine
...

This sequential refinement could compound improvements across sessions, though diminishing returns are likely.

The Self-Improvement Horizon

The paper's central thesis—"autoresearch can meta-autoresearch itself"—implies a natural question: can the outer loop meta-optimize the outer loop? A Level 3 that generates improvements to Level 2's research methodology would be a third level of recursion.

The authors do not explore this direction, but the principle extends: if Level 2 can generate search mechanisms, a Level 3 could generate Level 2 research session structures (e.g., changing the 4-round dialogue to a different format, adding constraint-checking rounds, etc.).

The practical limit is likely context window exhaustion—each additional level adds overhead that compresses the effective budget for the inner level.

15 Applications

Direct Applications

1. Neural Architecture Search (NAS)

The bilevel framework naturally extends to NAS: - Inner loop: Evaluate candidate architectures - Outer loop: Generate search strategies for the architecture space - The batch size discovery in the paper (where Level 2 mechanisms unlocked batch size as a critical parameter) demonstrates this potential

2. Hyperparameter Optimization at Scale

For organizations running many training jobs: - Deploy Level 2 to discover task-specific optimization strategies - Accumulate a mechanism library across projects - Select or generate mechanisms based on task similarity

3. AutoML Pipeline Design

The bilevel principle extends beyond hyperparameters: - Inner loop: Run an AutoML pipeline (feature selection, model selection, ensembling) - Outer loop: Generate improvements to the pipeline logic itself

4. Reinforcement Learning Algorithm Discovery

RL algorithms are essentially search mechanisms over policy space: - Inner loop: Standard RL training (PPO, SAC, etc.) - Outer loop: Generate modifications to the RL algorithm (reward shaping, exploration strategies, curriculum design)

5. Scientific Experiment Design

For automated laboratories: - Inner loop: Run experiments according to a design strategy - Outer loop: Generate better experimental design strategies based on outcomes

Methodological Applications

6. Understanding LLM Search Biases

The paper provides empirical evidence of specific LLM search biases: - Recency bias: Proposals cluster near recent successful configs - Parameter fixation: Some parameters are over-explored while others are ignored - Correlation assumptions: Parameters are changed in "conventional" bundles

These findings inform the design of any LLM-guided optimization system: the LLM's priors are both its strength (domain knowledge) and its weakness (systematic blind spots).

7. Mechanism Design for Evolutionary Systems

The generated mechanisms (Tabu, Bandit, DOE) provide a vocabulary of diversity-injection strategies that apply broadly to evolutionary and LLM-guided optimization:

Mechanism	Diversity Injection Pattern	Applicable To
Tabu Search	Memory-based visit avoidance	Any iterative search
UCB Bandit	Exploration-exploitation tradeoff	Multi-objective optimization
Orthogonal Exploration	Decorrelation of parameter changes	High-dimensional search

Limitations for Applications

Single benchmark: Generalization beyond GPT-50M pretraining is unproven
Small sample size: n=3 is insufficient for production deployment decisions
Code injection fragility: Silent fallback failures are dangerous in production
No dependency management: Generated code can import arbitrary libraries
Prompt-induced domain bias: Level 2 prompt suggests specific domains, limiting discovery scope
No safety guarantees: Generated mechanisms could cause training instability, resource exhaustion, or other unintended effects
LLM cost at scale: Each Level 2 session requires 4 LLM calls; at high outer-loop frequency, API costs accumulate

Broader Significance for the Autoresearch Field

Bilevel Autoresearch makes a conceptual contribution that extends beyond its specific results:

The fundamental bottleneck in autoresearch is not the inner loop's search parameters but its search mechanism. Improving how the system searches is categorically more valuable than improving what it searches. This insight—that mechanism design trumps parameter tuning—aligns with decades of optimization theory but had not been empirically demonstrated in the context of LLM-guided autoresearch.

The paper establishes a new axis of variation for autoresearch systems:

Axis	Examples	Level of Innovation
Task parameters	Hyperparameters, architecture choices	Low (standard optimization)
Search parameters	Temperature, exploration rate, batch size	Medium (meta-parameter tuning)
Search mechanisms	Proposal logic, acceptance criteria, memory	High (algorithm design)

Prior systems operate on the first two axes. Bilevel Autoresearch opens the third axis to autonomous optimization.

System	Inner Loop	Outer Loop	Meta-Optimization Target	LLM at Meta Level	Code Injection
Karpathy autoresearch	Propose-evaluate-accept	❌ None	—	—	❌
AutoResearchClaw	Multi-batch parallel	❌ None (human-designed)	—	—	❌
EvoScientist	With experience memory	❌ None (human-designed)	—	—	❌
FunSearch	LLM program generation	Evolutionary selection	Task programs	Same LLM	❌ (evolves task code)
Bilevel Autoresearch	Standard autoresearch	✅ LLM-driven	Search mechanism code	Same LLM	✅ Runtime injection
AlphaEvolve	LLM mutation + eval	MAP-Elites database	Task programs	Ensemble	❌ (evolves task code)
OpenEvolve	LLM mutation + eval	Program database	Task programs	Ensemble	❌ (evolves task code)

The critical distinction: FunSearch and AlphaEvolve evolve task-level programs. Bilevel Autoresearch evolves the search mechanism itself. The target of evolution is one level of abstraction higher—it optimizes the optimizer rather than the solution.

Appendix: Detailed Experimental Configuration

Training Configuration

Parameter	Value	Editable?
DEPTH	8	❌ Frozen
ASPECT_RATIO	64	❌ Frozen
LR	Variable	✅
WEIGHT_DECAY	Variable	✅
WINDOW_PATTERN	Variable	✅
HEAD_DIM	Variable	✅
TOTAL_BATCH_SIZE	Variable	✅
Training budget	300 seconds	Fixed
Model size	50M parameters	Fixed
Hardware	RTX 5090	Fixed

Ablation Group Configuration

Group	Level 1	Level 1.5	Level 2	Inner Iters	Outer Period	L2 Period
A	✅	❌	❌	30	—	—
B	✅	✅	❌	30	K=5	—
C	✅	✅	✅	30	K=5	M=2
D	✅	❌	✅	30	—	M=2 (triggered by iteration count)

Complete Results Table

Group	Repeat	Δval_bpb	Mechanism Injected	Notes
A	R1	−0.011	—	Consistent small improvement
A	R2	−0.009	—	Consistent small improvement
A	R3	−0.007	—	Consistent small improvement
B	R1	−0.000	—	L1.5 froze stalled params; LLM found nothing better
B	R2	−0.012	—	Best in Group B
B	R3	−0.005	—	Moderate
C	R1	−0.065	Tabu Search Manager	Best overall; mechanism broke fixation
C	R2	−0.011	Multi-Scale Bandit	Modest; bandit less effective for this task
C	R3	−0.058	Orthogonal Exploration	Strong; decorrelated parameter search
D	R1	−0.065	(Mechanism not named)	Matched C-R1's best
D	R2	−0.001	GP Regressor (reverted)	sklearn missing → fallback to L1
D	R3	−0.036	(Mechanism not named)	Moderate improvement