IronCarapace Efficiency Experiment Suite #1
IronCarapace Efficiency Experiment — Pre-Registration
Version: 1.0.0
Pre-registered: 2026-04-04
Git SHA at registration: (recorded by harness at run time)
Authors: IronCarapace project
This document is committed to the repository before any experiment data is collected.
It may not be amended after the first make run-experiment invocation. Any deviation from
this plan in the final analysis must be disclosed explicitly alongside the results.
1. Research Question
Does IronCarapace's Verification-Driven Development (VDD) pipeline — where deterministic external tools (mutation testing, property-based testing, type checking, contract validation) replace LLM self-evaluation loops — produce verified code at lower total LLM inference cost than alternative approaches, without sacrificing output quality?
The null result is acceptable. The goal is the truth, not confirmation of a prior belief.
2. Experimental Conditions
Three conditions run on every benchmark task:
| Condition | Label | Description |
|-----------|-------|-------------|
| A | vdd | Full IronCarapace VDD pipeline. All 7 deterministic gates active. Chainlink phases SPECIFY → BUILD → ATTACK → HARDEN → CONFESS. |
| B | directed | Directed workflow only. Same structured prompting and phase gates as Condition A, but deterministic verification tools are disabled. When a gate would fire, Claude is instead asked to self-evaluate the code against the specification and identify any issues. |
| C | vanilla | Vanilla Claude. Raw API call with task description as user message. Standard system prompt ("You are an expert software engineer. Write correct, idiomatic code."). Generate, run oracle tests externally, retry if failing, stop at budget or pass. No phase structure. |
Every task runs in all three conditions. Conditions for a given task run sequentially
(A then B then C) with a 60-second pause between conditions to avoid cache cross-contamination.
Task ordering across the suite is randomized per the seed in config.yaml.
3. Benchmark Task Suite
3.1 Task structure
Each task is a YAML file specifying:
id, tier (1/2/3), type (generation/repair), language (python/rust)
description: the exact natural-language specification passed to all three conditions
acceptance_criteria: human-written, for documentation only — not passed to agents
oracle_tests: path to post-hoc quality measurement tests (never shown to any agent)
reference_implementation: human-written ground truth
max_retries: budget cap applied equally to all conditions
timeout_seconds: wall-clock limit per condition
3.2 Composition target
| Tier | Description | Python | Rust | Total |
|------|-------------|--------|------|-------|
| 1 | Pure function, single file, <100 lines | 10 | 5 | 15 |
| 2 | Multi-function module, error handling, state | 10 | 5 | 15 |
| 3 | System-level correctness, concurrency, protocols | 8 | 4 | 12 |
| repair | Given broken code, identify and fix the defects | 6 | 2 | 8 |
| Total | | 34 | 16 | 50 |
Minimum for adequate statistical power: 50 paired tasks (see §7).
The v1 suite targets 50 tasks. Additional tasks may be added to v2 after analysis.
3.3 Oracle test isolation
Oracle tests live exclusively in experiments/oracle/. The harness never passes this directory path to any agent. Condition C's vanilla runner receives only the description string. Oracle tests are run post-hoc as a measurement tool after all three conditions have completed a task.
3.4 Repair task construction
Each repair task provides a broken_implementation file as input context (in addition to the description). The broken implementation contains 2–4 seeded defects of known types (off-by-one, race condition, incorrect invariant). The agent's task is to correct it.
The defect types and locations are documented in the task YAML but not passed to agents.
4. Hypotheses
All hypotheses use two-tailed tests unless directional hypothesis is explicitly noted.
Directional hypotheses (one-tailed) are marked *.
Primary
H₁ — Token efficiency *
Condition A (VDD) uses fewer total LLM tokens per successfully-completed task than
Condition C (vanilla).
Metric: total_input_tokens + total_output_tokens where oracle_pass_rate = 1.0
Test: Wilcoxon signed-rank (paired)
Direction: A < C
H₂ — Quality parity or improvement *
Condition A's oracle test pass rate is ≥ Condition C's on the same tasks.
Metric: oracle_pass_rate (fraction of oracle tests passing on final output)
Test: Wilcoxon signed-rank (paired)
Direction: A ≥ C
H₃ — VDD gate isolation
Condition B (directed, no gates) uses fewer tokens than Condition C (vanilla), independent of the VDD gates. Tests whether the structured workflow alone explains savings.
Metric: total_tokens per task
Test: Wilcoxon signed-rank (paired)
Note: If H₃ is true with effect size comparable to A vs C, gates are not the mechanism.
Secondary
H₄ — Retry reduction *
Condition A produces fewer LLM retry calls per successfully-completed task than Condition C.
Metric: retry_call_count (calls where retry_number > 0)
Direction: A < C
H₅ — Tier-stratified efficiency
The token efficiency advantage of A vs C is larger for Tier 3 tasks than Tier 1 tasks.
Metric: Per-tier median token delta (A − C)
Test: Mann-Whitney U on tier subgroups
Pre-registered direction: Tier 3 delta > Tier 1 delta (i.e., VDD helps more on complexity)
H₆ — Repair task amplification *
On repair tasks, the efficiency advantage of A vs C is larger than on generation tasks.
Metric: Token delta (A − C) by task type
Direction: repair delta > generation delta
H₇ — Prompt cache differential
The fraction of total input tokens served from prompt cache (`cache_read_tokens / total_input_tokens`) differs between Condition A and Condition C.
Test: Wilcoxon signed-rank (paired)
Note: If A has significantly higher cache fraction, some token savings are cache artifacts.
H₈ — Gate compute amortization
Total dollar cost per passing task (LLM tokens at API pricing + gate CPU at $0.048/cpu-hour)
is lower for Condition A than Condition C.
Metric: total_cost_usd including gate compute
Test: Wilcoxon signed-rank (paired)
Note: CPU rate of $0.048/cpu-hour is AWS c5.4xlarge on-demand as reference price.
H₉ — Quality monotonicity
The oracle pass rate gap (A − C) increases from Tier 1 to Tier 3.
Metric: oracle_pass_rate_A − oracle_pass_rate_C by tier
H₁₀ — Cross-language replication
The direction of H₁ (A < C in tokens) holds for both Python and Rust subsets.
Test: Wilcoxon signed-rank within each language subset
Note: Magnitude need not be equal; direction must replicate for the claim to generalize.
H₁₁ — First-attempt success *
Condition A's rate of tasks where the first LLM generation call produces oracle-passing code is higher than Condition C's.
Metric: Fraction of tasks where retry_number = 0 at task completion
Direction: A > C
H₁₂ — Cost-per-passing-task
Cost per passing task (total_cost_usd / oracle_passing_tasks) is lower for A than C.
Complements H₁ by weighting for quality (cheap failed outputs don't count).
Test: Mann-Whitney U (not paired, because failed tasks don't have a pair)
5. Primary Metrics
Ordered by priority. The first two are primary; the rest are secondary.
cost_per_passing_task_usd — total API cost divided by number of tasks where
oracle_pass_rate — fraction of oracle tests passing on final agent output, 0–1
total_tokens_per_task — input + output tokens per task
retry_call_count_per_task — LLM calls where retry_number > 0
mutation_survival_rate_on_output — measured post-hoc by running gate pipeline as oracle
cache_read_fraction — cache_read_tokens / total_input_tokens
gate_cpu_ms_per_task — sum of all gate CPU time (Condition A only)
6. Statistical Analysis Plan
6.1 Primary test
Wilcoxon signed-rank test (paired, two-tailed for H₂/H₃/H₇/H₈; one-tailed for H₁/H₄/H₆).
α = 0.05 for all primary hypotheses.
Multiple comparison correction: Bonferroni adjustment across all 12 hypotheses (α_adjusted = 0.05 / 12 = 0.0042).
For hypotheses with secondary analyses on subgroups (H₅, H₉, H₁₀), the subgroup tests are exploratory and not Bonferroni-corrected; they are reported descriptively.
6.2 Effect sizes
Report Cohen's d (for approximately normal distributions) or rank-biserial correlation (for Wilcoxon tests) for every hypothesis test. P-values alone are not reported without effect sizes.
6.3 Confidence intervals
Bootstrap 95% CIs (10,000 resamples) on all primary metrics. Percentile method.
6.4 Distribution checks
Shapiro-Wilk normality test on each metric before choosing parametric vs. non-parametric.
All token-count metrics are expected to be right-skewed (non-normal); use Wilcoxon.
Oracle pass rates are bounded [0,1]; use Wilcoxon.
6.5 Missing data and failures
Tasks where any condition results in AGENT_CRASHED, API_BUDGET_EXCEEDED, or REFUSED are excluded from paired tests for that metric but included in a failure-rate analysis.
Do not silently drop failures. Report failure counts per condition.
Rate-limit events (HTTP 429) are recorded. If rate-limit frequency differs significantly between conditions (>2× rate), flag as a potential confound and report separately.
6.6 Pre-specified stopping rule
Data collection runs until all 50 tasks complete in all 3 conditions or the wall-clock budget of 72 hours is exhausted. Do not add tasks after examining partial results.
Do not re-run failed tasks in a way that selectively includes successes.
7. Sample Size Justification
Target: N = 50 paired tasks.
Power analysis for Wilcoxon signed-rank test:
α = 0.05 (two-tailed), power = 0.80
For medium effect (r = 0.30): N ≈ 33 pairs
For small-medium effect (r = 0.20): N ≈ 72 pairs
50 tasks provides >80% power to detect a medium effect. The pre-specified claim of 35–40% token reduction corresponds to approximately r = 0.40–0.50 (large effect) if the variance across tasks is moderate, in which case N = 50 is well-powered.
If the true effect is small (r ≤ 0.20), N = 50 is underpowered. In that case, the analysis reports the observed effect size and CIs honestly and notes the power limitation.
A post-hoc sensitivity analysis will be reported: "given the observed effect size, what N would have been required for 80% power?"
8. Reproducibility Requirements
8.1 Code and environment
All containers run from pinned image digests (not floating tags).
Python environment defined by experiments/requirements.txt (pinned, hash-verified).
Rust toolchain pinned in experiments/tasks/rust-toolchain.toml.
cargo-mutants version pinned; mutant set changes between versions.
The harness checks git diff --quiet at startup and refuses to run on a dirty tree.
8.2 Randomness
All random seeds in experiments/config.yaml:
task_order_seed: randomizes task execution order within each condition
hypothesis_seed: base seed for Hypothesis strategies (per-task seed = base + task_index)
mutation_seed: base seed for mutation tool (cargo-mutants --jobs order)
8.3 Model pinning
The exact model ID string returned by the first API response is recorded in the manifest.
If it differs from the configured model ID (e.g., Anthropic silently updated the checkpoint), the run is flagged with a warning in the manifest.
8.4 API access requirement
**This experiment requires direct Anthropic API access (SDK) for accurate per-call token counting.** Claude CLI does not expose per-call token usage in a machine-readable form.
Condition A uses the IronCarapace Chainlink pipeline with TokenLedger instrumentation added to invoke_claude(). Conditions B and C use the anthropic Python SDK directly.
If running without API credits, set config.token_counting: estimated to fall back to character-count estimation (chars / 4.2). Estimated counts are clearly marked in all output and must not be used for the primary analysis — only for directional exploration.
8.5 Single-command reproduction
cd experiments make run-experiment SEED=20260404 CONFIG=config.yaml
This command:
Verifies the git tree is clean and records the SHA
Writes the manifest
Runs all 50 tasks × 3 conditions
Runs oracle measurement on all outputs
Runs the statistical analysis
Writes results/report-<run_id>.md
No manual steps. No environment variables beyond those in config.yaml.
9. What Null Results Look Like
These outcomes are pre-specified as valid scientific findings, not failures:
| Outcome | Interpretation |
|---------|---------------|
| A ≈ C in tokens, A > C in quality | IronCarapace is a quality tool, not a cost tool. Pitch changes. |
| A > C in tokens (IronCarapace more expensive) | Gate overhead exceeds retry savings at these task sizes. Scope the claim to Tier 3+ tasks. |
| H₃ true, A ≈ B in tokens | Savings are from structured workflow, not deterministic gates. Reframe the mechanism claim. |
| H₁₀ fails in Rust | Gate efficiency is language-tooling-dependent. Python-specific claim only. |
| Ablation shows 1–2 gates drive 80% of savings | Report per-gate contribution. Feature prioritization follows from data. |
| Rate-limit confound detected | Inconclusive for affected metrics. Re-run with rate-limiting equalized. |
10. Amendments
Any deviation from this plan (additional analyses, excluded tasks, changed metrics) must be documented in experiments/AMENDMENTS.md with:
The SHA of the commit introducing the amendment
The reason for the amendment
Whether the amendment was made before or after seeing any task results
Amendments made after seeing results are permissible as exploratory analysis but must be clearly labeled as post-hoc and not included in the primary Bonferroni-adjusted hypothesis tests.