NeoSigma: Self-Improving Agentic Systems
Part: Self-Improving Agent Systems
15.1 Overview & Motivation
Large-language-model-powered agents — code assistants, research copilots, autonomous software engineers — are now deployed across thousands of organizations. A persistent operational challenge is that these agents degrade silently: prompts that worked with one model version fail after a provider update; scaffolding code that handled common cases breaks on novel edge cases; evaluation harnesses that once caught regressions go stale as the task distribution shifts. The cost of manually maintaining an agentic system can rival or exceed the cost of building it in the first place.
Self-improving agentic systems address this problem by closing the loop between production failure observation and automated repair. The core insight is that a deployed agent generates a continuous stream of rich behavioral data — traces, tool calls, user feedback, evaluation outcomes — and that this data can be mined, prioritized, and fed back into the agent's own configuration to drive autonomous improvement. This pattern has been explored in various forms: Reflexion (Shinn et al., 2023) introduced verbal self-reflection; ExpeL (Zhao et al., 2023) demonstrated experience-driven learning; DSPy (Khattab et al., 2023) pioneered programmatic prompt optimization; and production-grade systems like SWE-agent and Devin incorporate feedback loops of varying sophistication.
What has been missing is a unified treatment of how these patterns — failure mining, prompt optimization, harness optimization, regression testing, and production feedback — compose into a coherent autonomous improvement cycle. This chapter proposes NeoSigma as a reference architecture for such composition. NeoSigma is not a single codebase; it is a design pattern language for self-improving agents, synthesized from the systems surveyed across Parts II–IV of this book. Its value lies in making explicit the design decisions, mathematical properties, and failure modes that arise when these components interact.
15.2 Architecture
The NeoSigma architecture is organized around a central improvement controller that orchestrates five subsystems, each responsible for one phase of the autonomous improvement cycle. The controller consumes signals from production, dispatches improvement tasks to the appropriate subsystem, validates proposed changes against regression tests, and gates deployment of approved updates.
Figure 15.1: NeoSigma reference architecture. Solid arrows represent data flow; dashed arrows represent gating signals. The improvement controller orchestrates the five subsystems and gates all changes through regression testing before deployment.
15.2.1 Component Responsibilities
Trace Store. Captures structured execution traces from the production agent: tool invocations, LLM prompts and completions, intermediate reasoning steps, evaluation outcomes, error stack traces, and user feedback signals. Traces are indexed by task type, outcome status, and timestamp. This component is analogous to structured logging in traditional software systems, but enriched with LLM-specific metadata such as token counts, model identifiers, and prompt templates used.
Failure Miner. Analyzes the trace store to identify, classify, and prioritize failure patterns. Rather than treating each failure as an isolated event, the miner clusters failures by root cause, estimates the frequency and severity of each cluster, and produces a ranked list of improvement targets. This is the system's diagnostic engine.
Prompt Optimizer. Given a failure pattern and the current prompt configuration, searches for prompt modifications that resolve the failure while maintaining performance on other tasks. Uses evolutionary or gradient-free optimization over the prompt space, guided by automated evaluation.
Harness Optimizer. Improves the evaluation and scaffolding infrastructure itself. When the failure miner identifies gaps in test coverage or when the prompt optimizer cannot make progress because the evaluation signal is too coarse, the harness optimizer generates new test cases, refines scoring functions, or adjusts the agent's tool-use scaffolding.
Regression Tester. Validates that any proposed change — to prompts, harness, or scaffolding — does not degrade performance on the existing task distribution. Uses statistical hypothesis testing to detect regressions before changes reach production.
Deployment Gate. A policy layer that enforces safety constraints: maximum change rate, human-approval thresholds for high-risk modifications, rollback triggers, and budget limits on improvement-cycle compute.
15.3 Failure Mining
Failure mining is the foundation of any self-improving system. Without accurate diagnosis of what is failing and why, optimization is blind search. The NeoSigma failure-mining pipeline operates in three stages: trace collection, failure clustering, and root-cause extraction.
15.3.1 Trace Collection and Failure Identification
Every agent execution produces a trace $\tau = (s_1, a_1, o_1, \ldots, s_T, a_T, o_T)$ where $s_t$ is the agent state at step $t$, $a_t$ is the action taken (tool call, LLM query, or internal reasoning step), and $o_t$ is the observed outcome. A trace is labeled as a failure if the terminal evaluation $\text{eval}(\tau) < \theta_{\text{pass}}$, where $\theta_{\text{pass}}$ is a task-specific acceptance threshold.
Not all failures are equally informative. A failure priority score combines frequency, severity, and recency:
where $c$ is a failure cluster, $f(c)$ is the number of traces assigned to cluster $c$ in the observation window, $\bar{s}(c)$ is the mean evaluation score of traces in $c$ (normalized to $[0,1]$ so that lower scores indicate higher severity), $\Delta t(c)$ is the time since the most recent failure in $c$, and $\lambda > 0$ is a decay rate that down-weights stale failure patterns. This formalization is author-derived; specific systems may weight these factors differently.
15.3.2 Failure Clustering
Raw failure traces are high-dimensional and heterogeneous. Clustering reduces them to actionable categories. The approach embeds each failure trace into a vector representation — using either the LLM's own embedding of the error context or a dedicated encoder — and applies density-based clustering (e.g., HDBSCAN) to group similar failures. Each cluster is then summarized by an LLM into a natural-language root-cause hypothesis.
This pattern appears in several systems surveyed in this book. ExpeL (Zhao et al., 2023) extracts "insights" from failed episodes; Reflexion (Shinn et al., 2023) generates verbal self-critiques; and production observability tools like Langfuse and LangSmith provide trace-level failure analysis. NeoSigma's contribution is to formalize the pipeline and connect it to downstream optimization.
# Pseudocode — no public implementation available
# Failure mining pipeline: cluster and prioritize failures
def mine_failures(trace_store, window_hours=168, min_cluster_size=3):
"""Extract and prioritize failure clusters from recent traces."""
# Stage 1: Collect recent failures
cutoff = current_time() - timedelta(hours=window_hours)
failed_traces = [
t for t in trace_store.query(since=cutoff)
if t.eval_score < t.task.pass_threshold
]
# Stage 2: Embed and cluster
embeddings = [embed_failure_context(t) for t in failed_traces]
cluster_labels = hdbscan_cluster(
embeddings, min_cluster_size=min_cluster_size
)
# Stage 3: Summarize each cluster with an LLM
clusters = group_by(failed_traces, cluster_labels)
ranked = []
for cluster_id, traces in clusters.items():
summary = llm_summarize_root_cause(traces[:10]) # sample for cost
frequency = len(traces)
mean_score = mean(t.eval_score for t in traces)
most_recent = max(t.timestamp for t in traces)
recency = exp(-LAMBDA * (current_time() - most_recent).hours)
priority = frequency * (1.0 - mean_score) * recency
ranked.append(FailureCluster(
id=cluster_id,
summary=summary,
priority=priority,
example_traces=traces[:5],
count=frequency,
))
return sorted(ranked, key=lambda c: c.priority, reverse=True)
15.3.3 Root-Cause Extraction
Clustering identifies what failed; root-cause extraction identifies why. The NeoSigma pattern uses the LLM itself as a diagnostic tool: given a cluster of failure traces, the LLM is prompted to identify the specific prompt instruction, tool-use pattern, or reasoning failure that caused the cluster. This produces a structured diagnosis that the prompt optimizer can act on directly.
A critical design decision is whether to extract root causes at the trace level (one diagnosis per failure) or the cluster level (one diagnosis per group). Trace-level extraction is more accurate but more expensive. Cluster-level extraction amortizes LLM cost across many failures but may miss heterogeneity within a cluster. The NeoSigma reference design defaults to cluster-level extraction with trace-level refinement when the cluster is large or high-priority.
15.4 Prompt Optimization
Given a diagnosed failure pattern, the prompt optimizer searches for modifications to the agent's prompt configuration that resolve the failure. This is a black-box optimization problem: the prompt is a structured text object, the objective is evaluation performance, and gradients are unavailable.
15.4.1 Problem Formulation
Let $p \in \mathcal{P}$ be a prompt configuration (which may include a system prompt, few-shot examples, tool descriptions, and formatting instructions). Let $\mathcal{T}_{\text{fix}}$ be the set of tasks associated with the target failure cluster and $\mathcal{T}_{\text{hold}}$ be a held-out set of tasks representing the broader distribution. The optimization objective is:
where $\text{eval}(p, t)$ is the evaluation score of running the agent with prompt $p$ on task $t$, $\bar{E}_{\text{hold}}^{(0)}$ is the baseline mean performance on the holdout set, and $\alpha > 0$ is a penalty weight that controls the trade-off between fixing the target failure and avoiding regressions. When $\alpha$ is large, the optimizer is conservative; when small, it aggressively pursues the fix at the risk of side effects.
15.4.2 Search Strategy
The search over prompt space uses an evolutionary approach, consistent with the methods surveyed in Part II of this book. A population of candidate prompts is maintained, with mutations generated by an LLM that is given the failure diagnosis and the current prompt. This is similar to the approach taken by EvoPrompt (Guo et al., 2024) and OPRO (Yang et al., 2023), but the mutation operator is failure-directed: instead of generic prompt perturbation, the LLM is asked specifically to modify the prompt in a way that addresses the diagnosed root cause.
# Pseudocode — no public implementation available
# Failure-directed prompt optimization loop
def optimize_prompt(current_prompt, failure_cluster, task_sets, config):
"""Evolutionary prompt optimization targeting a specific failure."""
population = [current_prompt]
# Generate initial mutations directed at the failure
for _ in range(config.initial_population_size - 1):
mutant = llm_mutate_prompt(
prompt=current_prompt,
failure_summary=failure_cluster.summary,
instruction="Modify this prompt to address the described failure "
"pattern while preserving general capability.",
)
population.append(mutant)
best_prompt = current_prompt
best_score = evaluate_prompt(current_prompt, task_sets, config.alpha)
for generation in range(config.max_generations):
# Evaluate all candidates
scores = []
for candidate in population:
fix_score = mean(
run_eval(candidate, t) for t in task_sets.fix_tasks
)
hold_score = mean(
run_eval(candidate, t) for t in task_sets.holdout_tasks
)
regression = max(0, task_sets.baseline_hold - hold_score)
combined = fix_score - config.alpha * regression
scores.append(combined)
if combined > best_score:
best_score = combined
best_prompt = candidate
# Selection and reproduction
parents = tournament_select(population, scores, k=config.tourney_k)
offspring = []
for parent in parents:
child = llm_mutate_prompt(
prompt=parent,
failure_summary=failure_cluster.summary,
instruction="Further refine this prompt to improve on the "
"failure pattern. Be specific and minimal.",
)
offspring.append(child)
# Optional: crossover between high-scoring candidates
if len(parents) >= 2 and random() < config.crossover_rate:
crossed = llm_crossover_prompts(parents[0], parents[1])
offspring.append(crossed)
population = elitist_replacement(population, offspring, scores)
# Early stopping if target is met
if best_score >= config.target_score:
break
return best_prompt, best_score
15.4.3 Cost Control
Prompt optimization is expensive: each candidate evaluation requires running the agent on multiple tasks. The NeoSigma design mitigates this through three mechanisms. First, a cascade evaluation strategy (Section 15.5) evaluates candidates on a small, cheap subset before committing to the full evaluation suite. Second, early stopping terminates evaluation of a candidate as soon as it fails on a critical task. Third, budget caps limit the total number of LLM calls per optimization cycle, with unused budget rolling over to the next cycle.
The total cost of one optimization cycle can be bounded as:
where $N_{\text{pop}}$ is the population size, $G_{\text{max}}$ is the maximum number of generations, $C_{\text{mut}}$ is the cost of one LLM mutation call, and $C_{\text{eval}}$ is the cost of one agent evaluation. In practice, cascade evaluation and early stopping reduce the effective cost well below this upper bound. The exact savings depend on the task distribution and failure rate.
15.5 Harness Optimization
A subtler form of self-improvement targets the evaluation infrastructure itself. If the agent's test harness does not cover a failure mode, prompt optimization cannot detect or reward fixes for it. If the scoring function is too coarse, the optimizer lacks gradient signal. Harness optimization addresses these gaps.
15.5.1 Test Case Generation from Failures
When the failure miner identifies a novel failure pattern that is not covered by existing test cases, the harness optimizer generates new test cases that exercise the failure mode. The LLM is prompted with the failure cluster summary and asked to produce minimal, reproducible test inputs that would trigger the same failure.
This is related to the fuzz-testing literature (Zeller et al., 2019) but operates at the semantic level rather than the syntactic level: the generated test cases are meaningful tasks that a human would recognize as valid, not random byte sequences. It also draws on the concept of adversarial curriculum generation (Dennis et al., 2020), where the environment is modified to expose agent weaknesses.
15.5.2 Scoring Function Refinement
When the prompt optimizer makes progress on the fix tasks but the improvement does not transfer to production, the scoring function may be misaligned with real-world success criteria. The harness optimizer can propose refinements to the scoring function based on analysis of traces where the automated score disagrees with user feedback. This is a form of reward model refinement applied to agentic evaluation rather than RLHF.
15.5.3 Scaffold Modification
Beyond prompts, the agent's behavior is shaped by its scaffolding: the tool descriptions, output parsers, retry logic, and context-management code that frame the LLM's interaction with the environment. The harness optimizer can propose modifications to these components when failure analysis indicates that the root cause is structural (e.g., a tool description that misleads the model, or a context window that truncates critical information).
# Pseudocode — no public implementation available
# Harness optimization: generate test cases from failure clusters
def generate_regression_tests(failure_cluster, existing_tests, config):
"""Create new test cases that exercise a discovered failure mode."""
# Analyze the failure pattern
failure_description = failure_cluster.summary
example_inputs = [t.task_input for t in failure_cluster.example_traces]
# Generate candidate test cases via LLM
candidates = llm_generate_test_cases(
failure_description=failure_description,
examples=example_inputs,
existing_tests=existing_tests[:20], # avoid duplicates
num_candidates=config.num_candidates,
instruction="Generate minimal test inputs that would trigger this "
"failure pattern. Each test should be self-contained "
"and have a clear expected outcome.",
)
# Validate: run current agent on each candidate
validated = []
for test_case in candidates:
result = run_agent(test_case.input)
# Confirm the test actually triggers the failure
if result.eval_score < test_case.expected_threshold:
test_case.verified = True
validated.append(test_case)
# Deduplicate against existing test suite
new_tests = deduplicate(validated, existing_tests, similarity_threshold=0.85)
return new_tests
15.6 Regression Testing
Every improvement carries risk. A prompt change that fixes one failure mode may degrade performance on another. The regression tester provides a statistical guarantee that proposed changes do not introduce regressions beyond an acceptable tolerance.
15.6.1 Statistical Regression Detection
Given a baseline prompt $p_0$ and a candidate prompt $p_1$, the regression tester evaluates both on a held-out task suite $\mathcal{T}_{\text{reg}}$ and applies a one-sided hypothesis test. The null hypothesis is that $p_1$ is no worse than $p_0$:
where $\mu(p)$ is the expected evaluation score under prompt $p$ and $\delta \geq 0$ is a non-inferiority margin that allows small, tolerable degradations. The test statistic is:
where $\bar{s}_0, \bar{s}_1$ are sample means, $\sigma_0^2, \sigma_1^2$ are sample variances, and $n_0, n_1$ are the number of evaluation tasks. If $Z < -z_\alpha$ (the critical value at significance level $\alpha$), the change is rejected as a regression. This is a standard non-inferiority test adapted for the agentic evaluation setting.
A practical concern is the cost of evaluation. Running both prompts on a large regression suite is expensive. The NeoSigma design uses paired evaluation on the same tasks to reduce variance (enabling smaller sample sizes) and sequential testing (Wald's SPRT) to terminate early when the evidence is clear in either direction.
15.6.2 Multi-Dimensional Regression
Agent performance is rarely captured by a single metric. A prompt change might improve code correctness but degrade code style, or improve task completion rate but increase latency. The regression tester checks multiple metrics simultaneously, applying a Bonferroni correction or false-discovery-rate control to maintain the overall Type I error rate.
15.7 The Autonomous Improvement Cycle
The five components described above compose into a closed-loop cycle. The following diagram and pseudocode describe the complete autonomous improvement cycle as orchestrated by the improvement controller.
Figure 15.2: The autonomous improvement cycle. Steps 1–6 repeat continuously. If regression testing (step 4) rejects a candidate, the cycle returns to step 3 for further optimization. Dashed arrow indicates the retry path.
15.7.1 Cycle Orchestration
# Pseudocode — no public implementation available
# The complete autonomous improvement cycle
def run_improvement_cycle(agent, trace_store, test_suite, config):
"""One iteration of the NeoSigma autonomous improvement cycle."""
# Step 1: Mine failures from recent production traces
failure_clusters = mine_failures(
trace_store, window_hours=config.mining_window
)
if not failure_clusters:
log("No actionable failures found. Cycle idle.")
return NoChange()
# Step 2: Prioritize and select the top target
target = failure_clusters[0] # highest-priority cluster
log(f"Targeting failure cluster: {target.summary} "
f"(priority={target.priority:.2f}, count={target.count})")
# Step 3: Determine optimization type and execute
if target.root_cause_type == "prompt":
candidate, score = optimize_prompt(
agent.current_prompt, target, test_suite, config.prompt_opt
)
change = PromptChange(old=agent.current_prompt, new=candidate)
elif target.root_cause_type == "harness":
new_tests = generate_regression_tests(
target, test_suite.existing_tests, config.harness_opt
)
change = HarnessChange(new_tests=new_tests)
elif target.root_cause_type == "scaffold":
candidate = optimize_scaffold(
agent.scaffold, target, test_suite, config.scaffold_opt
)
change = ScaffoldChange(old=agent.scaffold, new=candidate)
else:
log(f"Unknown root cause type: {target.root_cause_type}. Skipping.")
return NoChange()
# Step 4: Regression testing
for attempt in range(config.max_regression_retries):
regression_result = regression_test(
baseline=agent, candidate=change,
test_suite=test_suite.regression_tasks,
significance=config.significance_level,
margin=config.non_inferiority_margin,
)
if regression_result.passed:
break
else:
log(f"Regression detected: {regression_result.details}. "
f"Retry {attempt + 1}/{config.max_regression_retries}")
# Refine the candidate, constrained to avoid the regression
change = refine_candidate(change, regression_result, target)
else:
log("Max retries exceeded. Abandoning this improvement target.")
return Rejected(target=target, reason="persistent_regression")
# Step 5: Deployment gate
if change.risk_level > config.auto_deploy_threshold:
log("High-risk change. Queuing for human review.")
return PendingReview(change=change, target=target)
# Step 6: Deploy
apply_change(agent, change)
log(f"Deployed change for cluster {target.id}. "
f"Monitoring for {config.monitoring_hours}h.")
return Deployed(change=change, target=target)
15.7.2 Convergence Properties
A natural question is whether the improvement cycle converges — i.e., whether repeated application eventually reaches a stable configuration. Define the agent's performance at cycle $k$ as $V_k = \mathbb{E}_{t \sim \mathcal{D}} [\text{eval}(p_k, t)]$, where $\mathcal{D}$ is the production task distribution and $p_k$ is the prompt configuration after $k$ improvement cycles. Under the NeoSigma design, the regression gate ensures:
where $\delta$ is the non-inferiority margin from the regression test. This is a monotonic improvement guarantee up to tolerance $\delta$. In the idealized case where $\delta = 0$ and the regression test has perfect power, the sequence $\{V_k\}$ is non-decreasing and bounded (by the maximum achievable score), so it converges.
In practice, several factors complicate convergence. First, the regression test has finite power: it may fail to detect small regressions, allowing gradual drift. Second, the task distribution $\mathcal{D}$ is non-stationary in production, so the target is moving. Third, the optimization landscape is non-convex and the evolutionary search may get stuck in local optima. The practical implication is that the cycle should be monitored for effective convergence — diminishing improvement per cycle — and paused when marginal gains fall below the cost of the cycle itself.
15.8 Production Feedback Loops
The distinction between offline self-improvement and production-coupled self-improvement is crucial. Offline systems optimize against a fixed benchmark; production-coupled systems incorporate real-world signals that evolve with the deployment context.
15.8.1 Signal Types
Production feedback comes in several forms, ordered by signal quality:
| Signal Type | Latency | Quality | Example |
|---|---|---|---|
| Explicit feedback | Minutes | High | User thumbs-up/down, correction edits |
| Implicit behavioral | Seconds | Medium | Acceptance rate, edit distance, time-to-accept |
| Downstream outcome | Hours–Days | High | CI pass rate, deployment success, test results |
| Automated evaluation | Seconds | Variable | LLM-as-judge, static analysis, test execution |
The key challenge is feedback delay. Explicit human feedback is high-quality but arrives slowly and sparsely. Automated evaluation is fast but may be misaligned with real-world utility. The NeoSigma design uses a multi-signal aggregation strategy: fast, low-quality signals drive the failure miner's real-time prioritization, while slow, high-quality signals calibrate the evaluation harness over time.
15.8.2 Safety Constraints on Production Feedback
A self-improving system that learns from production feedback faces a feedback loop risk: if the agent's behavior influences the feedback signal, the system can enter a positive feedback loop that amplifies errors. For example, if the agent generates outputs that users accept out of convenience rather than quality, the feedback signal will appear positive even as quality degrades.
Mitigation strategies include: (1) maintaining a fixed, human-curated evaluation set that is never modified by the improvement cycle, serving as an anchor; (2) monitoring distributional shift between the agent's outputs and a reference distribution; and (3) rate-limiting the improvement cycle so that changes propagate slowly enough for human oversight to catch systematic drift.
15.9 Comparative Analysis
The following table positions the NeoSigma reference architecture against related self-improving agent approaches discussed elsewhere in this survey. All characterizations are based on published descriptions of each system.
| System | Failure Mining | Prompt Opt. | Harness Opt. | Regression Gate | Prod. Feedback |
|---|---|---|---|---|---|
| Reflexion | Per-episode | Verbal only | — | — | — |
| ExpeL | Insight extraction | Few-shot selection | — | — | — |
| DSPy | — | Systematic | — | Dev set | — |
| OPRO | — | Evolutionary | — | — | — |
| Voyager | Per-episode | — | Skill library | — | Env. feedback |
| NeoSigma (reference) | Clustered + ranked | Failure-directed evo. | Test gen + scoring | Statistical gate | Multi-signal |
The key observation is that most existing systems implement one or two of these capabilities in isolation. Reflexion and ExpeL provide failure-driven learning but do not systematically optimize prompts or guard against regressions. DSPy provides systematic prompt optimization but does not mine failures from production traces. Voyager builds a skill library (a form of harness improvement) but operates in a simulation environment without production feedback. The NeoSigma architecture's contribution is to formalize how these capabilities compose and to identify the interaction effects — particularly the critical role of regression testing as a safety mechanism that enables aggressive optimization.
It should be noted that this comparison is based on published system descriptions, and some systems may implement additional capabilities not documented in their papers. The comparison is intended to illustrate architectural patterns, not to rank systems.
15.10 Limitations & Open Questions
15.10.1 Evaluation Fidelity
The entire improvement cycle is only as good as its evaluation signal. If the automated evaluator is misaligned with true task success — a pervasive problem in LLM evaluation (Zheng et al., 2024) — the optimizer may "improve" on the metric while degrading real-world performance. This is the self-improving analog of Goodhart's Law: when the measure becomes the target, it ceases to be a good measure. Harness optimization partially addresses this, but cannot fully solve it without ground-truth supervision.
15.10.2 Compounding Errors
Even with a non-inferiority margin of $\delta = 0.01$, after 100 improvement cycles the cumulative allowed degradation is $100\delta = 1.0$, which may represent a meaningful regression on a $[0,1]$ scale. This is a well-known problem in sequential decision-making. Practical mitigation requires periodic re-baselining against a gold-standard evaluation set and setting $\delta$ in the context of the expected number of cycles.
15.10.3 Distribution Shift
Production task distributions change over time — new programming languages emerge, coding styles shift, user expectations evolve. An improvement cycle optimized for last month's failures may not address next month's challenges. The system must continuously refresh its failure mining and re-evaluate the relevance of existing test cases.
15.10.4 Cost
Autonomous improvement cycles are expensive. Each cycle involves failure mining (LLM calls for clustering and summarization), prompt optimization (many agent evaluations), and regression testing (full suite evaluation). For a production agent handling thousands of tasks per day, the improvement cycle's compute budget may rival the agent's own operational cost. The economics are favorable only when the agent's error rate is high enough that improvements yield measurable value.
15.10.5 Lack of Empirical Validation
As noted in the evidential disclaimer at the start of this chapter, NeoSigma is a reference architecture, not a deployed system with published benchmarks. The convergence properties discussed in Section 15.7.2 are theoretical, and the practical effectiveness of the full cycle — with all five components operating jointly — has not been empirically validated in a controlled study. Individual components (prompt optimization, regression testing) have been validated in isolation by the systems cited above, but the composition remains an open research question.
15.10.6 Safety and Alignment
A self-improving agent system raises alignment concerns. If the improvement cycle modifies the agent's behavior autonomously, it may drift in directions that are operationally effective but undesirable from a safety or policy perspective. The deployment gate (Section 15.2.1) provides a coarse control mechanism, but more sophisticated oversight — such as constitutional AI constraints on the improvement process itself, or formal verification of prompt invariants — may be necessary for high-stakes deployments.
15.11 Summary
Key takeaway. Self-improving agentic systems require the coordinated operation of five capabilities: failure mining, prompt optimization, harness optimization, regression testing, and production feedback integration. The NeoSigma reference architecture formalizes how these capabilities compose into a closed-loop autonomous improvement cycle with monotonic performance guarantees (up to a non-inferiority margin) enforced by statistical regression gating.
Main contribution. By synthesizing patterns from Reflexion, ExpeL, DSPy, OPRO, and production-grade agent systems, the NeoSigma framework identifies that the regression testing gate is the critical enabler of autonomous improvement: without it, optimization may improve targeted failures while degrading overall performance, but with it, aggressive failure-directed prompt search becomes safe. The framework also highlights the under-explored problem of harness optimization — improving the evaluation infrastructure itself — as a bottleneck for sustained self-improvement.
For researchers. The most important open question is empirical: does the full five-component cycle, operating jointly in production, converge to meaningfully better performance than any single component in isolation? The individual primitives are well-studied, but their interaction effects — including compounding regression tolerance, evaluation-metric gaming, and distribution shift — remain largely unexplored. Controlled experiments comparing full-cycle operation against ablations removing each component would provide the strongest evidence for or against the integrated design.