IronCarapace Efficiency Experiment Suite #1

made using Leaflet

IronCarapace Efficiency Experiment — Pre-Registration

Version: 1.0.0

Pre-registered: 2026-04-04

Git SHA at registration: (recorded by harness at run time)

Authors: IronCarapace project

This document is committed to the repository before any experiment data is collected.

It may not be amended after the first make run-experiment invocation. Any deviation from

this plan in the final analysis must be disclosed explicitly alongside the results.

1. Research Question

Does IronCarapace's Verification-Driven Development (VDD) pipeline — where deterministic external tools (mutation testing, property-based testing, type checking, contract validation) replace LLM self-evaluation loops — produce verified code at lower total LLM inference cost than alternative approaches, without sacrificing output quality?

The null result is acceptable. The goal is the truth, not confirmation of a prior belief.

2. Experimental Conditions

Three conditions run on every benchmark task:

| Condition | Label | Description |

|-----------|-------|-------------|

| A | vdd | Full IronCarapace VDD pipeline. All 7 deterministic gates active. Chainlink phases SPECIFY → BUILD → ATTACK → HARDEN → CONFESS. |

| B | directed | Directed workflow only. Same structured prompting and phase gates as Condition A, but deterministic verification tools are disabled. When a gate would fire, Claude is instead asked to self-evaluate the code against the specification and identify any issues. |

| C | vanilla | Vanilla Claude. Raw API call with task description as user message. Standard system prompt ("You are an expert software engineer. Write correct, idiomatic code."). Generate, run oracle tests externally, retry if failing, stop at budget or pass. No phase structure. |

Every task runs in all three conditions. Conditions for a given task run sequentially

(A then B then C) with a 60-second pause between conditions to avoid cache cross-contamination.

Task ordering across the suite is randomized per the seed in config.yaml.

3. Benchmark Task Suite

3.1 Task structure

Each task is a YAML file specifying:

id, tier (1/2/3), type (generation/repair), language (python/rust)

description: the exact natural-language specification passed to all three conditions

acceptance_criteria: human-written, for documentation only — not passed to agents

oracle_tests: path to post-hoc quality measurement tests (never shown to any agent)

reference_implementation: human-written ground truth

max_retries: budget cap applied equally to all conditions

timeout_seconds: wall-clock limit per condition

3.2 Composition target

|------|-------------|--------|------|-------|

| 1 | Pure function, single file, <100 lines | 10 | 5 | 15 |

| 2 | Multi-function module, error handling, state | 10 | 5 | 15 |

| 3 | System-level correctness, concurrency, protocols | 8 | 4 | 12 |

| repair | Given broken code, identify and fix the defects | 6 | 2 | 8 |

| Total | | 34 | 16 | 50 |

Minimum for adequate statistical power: 50 paired tasks (see §7).

The v1 suite targets 50 tasks. Additional tasks may be added to v2 after analysis.

3.3 Oracle test isolation

Oracle tests live exclusively in experiments/oracle/. The harness never passes this directory path to any agent. Condition C's vanilla runner receives only the description string. Oracle tests are run post-hoc as a measurement tool after all three conditions have completed a task.

3.4 Repair task construction

Each repair task provides a broken_implementation file as input context (in addition to the description). The broken implementation contains 2–4 seeded defects of known types (off-by-one, race condition, incorrect invariant). The agent's task is to correct it.

The defect types and locations are documented in the task YAML but not passed to agents.

4. Hypotheses

All hypotheses use two-tailed tests unless directional hypothesis is explicitly noted.

Directional hypotheses (one-tailed) are marked *.

Primary

H₁ — Token efficiency *

Condition A (VDD) uses fewer total LLM tokens per successfully-completed task than

Condition C (vanilla).

Metric: total_input_tokens + total_output_tokens where oracle_pass_rate = 1.0

Test: Wilcoxon signed-rank (paired)

Direction: A < C

H₂ — Quality parity or improvement *

Condition A's oracle test pass rate is ≥ Condition C's on the same tasks.

Metric: oracle_pass_rate (fraction of oracle tests passing on final output)

Test: Wilcoxon signed-rank (paired)

Direction: A ≥ C

H₃ — VDD gate isolation

Condition B (directed, no gates) uses fewer tokens than Condition C (vanilla), independent of the VDD gates. Tests whether the structured workflow alone explains savings.

Metric: total_tokens per task

Test: Wilcoxon signed-rank (paired)

Note: If H₃ is true with effect size comparable to A vs C, gates are not the mechanism.

Secondary

H₄ — Retry reduction *

Condition A produces fewer LLM retry calls per successfully-completed task than Condition C.

Metric: retry_call_count (calls where retry_number > 0)

Direction: A < C

H₅ — Tier-stratified efficiency

The token efficiency advantage of A vs C is larger for Tier 3 tasks than Tier 1 tasks.

Metric: Per-tier median token delta (A − C)

Test: Mann-Whitney U on tier subgroups

Pre-registered direction: Tier 3 delta > Tier 1 delta (i.e., VDD helps more on complexity)

H₆ — Repair task amplification *

On repair tasks, the efficiency advantage of A vs C is larger than on generation tasks.

Metric: Token delta (A − C) by task type

Direction: repair delta > generation delta

H₇ — Prompt cache differential

The fraction of total input tokens served from prompt cache (`cache_read_tokens / total_input_tokens`) differs between Condition A and Condition C.

Test: Wilcoxon signed-rank (paired)

Note: If A has significantly higher cache fraction, some token savings are cache artifacts.

H₈ — Gate compute amortization

Total dollar cost per passing task (LLM tokens at API pricing + gate CPU at $0.048/cpu-hour)

is lower for Condition A than Condition C.

Metric: total_cost_usd including gate compute

Test: Wilcoxon signed-rank (paired)

Note: CPU rate of $0.048/cpu-hour is AWS c5.4xlarge on-demand as reference price.

H₉ — Quality monotonicity

The oracle pass rate gap (A − C) increases from Tier 1 to Tier 3.

Metric: oracle_pass_rate_A − oracle_pass_rate_C by tier

H₁₀ — Cross-language replication

The direction of H₁ (A < C in tokens) holds for both Python and Rust subsets.

Test: Wilcoxon signed-rank within each language subset

Note: Magnitude need not be equal; direction must replicate for the claim to generalize.

H₁₁ — First-attempt success *

Condition A's rate of tasks where the first LLM generation call produces oracle-passing code is higher than Condition C's.

Metric: Fraction of tasks where retry_number = 0 at task completion

Direction: A > C

H₁₂ — Cost-per-passing-task

Cost per passing task (total_cost_usd / oracle_passing_tasks) is lower for A than C.

Complements H₁ by weighting for quality (cheap failed outputs don't count).

Test: Mann-Whitney U (not paired, because failed tasks don't have a pair)

5. Primary Metrics

Ordered by priority. The first two are primary; the rest are secondary.

cost_per_passing_task_usd — total API cost divided by number of tasks where

oracle_pass_rate — fraction of oracle tests passing on final agent output, 0–1

total_tokens_per_task — input + output tokens per task

retry_call_count_per_task — LLM calls where retry_number > 0

mutation_survival_rate_on_output — measured post-hoc by running gate pipeline as oracle

cache_read_fraction — cache_read_tokens / total_input_tokens

gate_cpu_ms_per_task — sum of all gate CPU time (Condition A only)

6. Statistical Analysis Plan

6.1 Primary test

Wilcoxon signed-rank test (paired, two-tailed for H₂/H₃/H₇/H₈; one-tailed for H₁/H₄/H₆).

α = 0.05 for all primary hypotheses.

Multiple comparison correction: Bonferroni adjustment across all 12 hypotheses (α_adjusted = 0.05 / 12 = 0.0042).

For hypotheses with secondary analyses on subgroups (H₅, H₉, H₁₀), the subgroup tests are exploratory and not Bonferroni-corrected; they are reported descriptively.

6.2 Effect sizes

Report Cohen's d (for approximately normal distributions) or rank-biserial correlation (for Wilcoxon tests) for every hypothesis test. P-values alone are not reported without effect sizes.

6.3 Confidence intervals

Bootstrap 95% CIs (10,000 resamples) on all primary metrics. Percentile method.

6.4 Distribution checks

Shapiro-Wilk normality test on each metric before choosing parametric vs. non-parametric.

All token-count metrics are expected to be right-skewed (non-normal); use Wilcoxon.

Oracle pass rates are bounded [0,1]; use Wilcoxon.

6.5 Missing data and failures

Tasks where any condition results in AGENT_CRASHED, API_BUDGET_EXCEEDED, or REFUSED are excluded from paired tests for that metric but included in a failure-rate analysis.

Do not silently drop failures. Report failure counts per condition.

Rate-limit events (HTTP 429) are recorded. If rate-limit frequency differs significantly between conditions (>2× rate), flag as a potential confound and report separately.

6.6 Pre-specified stopping rule

Data collection runs until all 50 tasks complete in all 3 conditions or the wall-clock budget of 72 hours is exhausted. Do not add tasks after examining partial results.

Do not re-run failed tasks in a way that selectively includes successes.

7. Sample Size Justification

Target: N = 50 paired tasks.

Power analysis for Wilcoxon signed-rank test:

α = 0.05 (two-tailed), power = 0.80

For medium effect (r = 0.30): N ≈ 33 pairs

For small-medium effect (r = 0.20): N ≈ 72 pairs

50 tasks provides >80% power to detect a medium effect. The pre-specified claim of 35–40% token reduction corresponds to approximately r = 0.40–0.50 (large effect) if the variance across tasks is moderate, in which case N = 50 is well-powered.

If the true effect is small (r ≤ 0.20), N = 50 is underpowered. In that case, the analysis reports the observed effect size and CIs honestly and notes the power limitation.

A post-hoc sensitivity analysis will be reported: "given the observed effect size, what N would have been required for 80% power?"

8. Reproducibility Requirements

8.1 Code and environment

All containers run from pinned image digests (not floating tags).

Python environment defined by experiments/requirements.txt (pinned, hash-verified).

Rust toolchain pinned in experiments/tasks/rust-toolchain.toml.

cargo-mutants version pinned; mutant set changes between versions.

The harness checks git diff --quiet at startup and refuses to run on a dirty tree.

8.2 Randomness

All random seeds in experiments/config.yaml:

task_order_seed: randomizes task execution order within each condition

hypothesis_seed: base seed for Hypothesis strategies (per-task seed = base + task_index)

mutation_seed: base seed for mutation tool (cargo-mutants --jobs order)

8.3 Model pinning

The exact model ID string returned by the first API response is recorded in the manifest.

If it differs from the configured model ID (e.g., Anthropic silently updated the checkpoint), the run is flagged with a warning in the manifest.

8.4 API access requirement

**This experiment requires direct Anthropic API access (SDK) for accurate per-call token counting.** Claude CLI does not expose per-call token usage in a machine-readable form.

Condition A uses the IronCarapace Chainlink pipeline with TokenLedger instrumentation added to invoke_claude(). Conditions B and C use the anthropic Python SDK directly.

If running without API credits, set config.token_counting: estimated to fall back to character-count estimation (chars / 4.2). Estimated counts are clearly marked in all output and must not be used for the primary analysis — only for directional exploration.

8.5 Single-command reproduction

cd experiments

make run-experiment SEED=20260404 CONFIG=config.yaml

This command:

Verifies the git tree is clean and records the SHA

Writes the manifest

Runs all 50 tasks × 3 conditions

Runs oracle measurement on all outputs

Runs the statistical analysis

Writes results/report-<run_id>.md

No manual steps. No environment variables beyond those in config.yaml.

9. What Null Results Look Like

These outcomes are pre-specified as valid scientific findings, not failures:

| Outcome | Interpretation |

|---------|---------------|

| A ≈ C in tokens, A > C in quality | IronCarapace is a quality tool, not a cost tool. Pitch changes. |

| A > C in tokens (IronCarapace more expensive) | Gate overhead exceeds retry savings at these task sizes. Scope the claim to Tier 3+ tasks. |

| H₃ true, A ≈ B in tokens | Savings are from structured workflow, not deterministic gates. Reframe the mechanism claim. |

| H₁₀ fails in Rust | Gate efficiency is language-tooling-dependent. Python-specific claim only. |

| Ablation shows 1–2 gates drive 80% of savings | Report per-gate contribution. Feature prioritization follows from data. |

| Rate-limit confound detected | Inconclusive for affected metrics. Re-run with rate-limiting equalized. |

10. Amendments

Any deviation from this plan (additional analyses, excluded tasks, changed metrics) must be documented in experiments/AMENDMENTS.md with:

The SHA of the commit introducing the amendment

The reason for the amendment

Whether the amendment was made before or after seeing any task results

Amendments made after seeing results are permissible as exploratory analysis but must be clearly labeled as post-hoc and not included in the primary Bonferroni-adjusted hypothesis tests.

made using Leaflet