Notes · Changelog

Every adjustment to the experiment, on the record

A live experiment is only as good as its referees, so when we change anything — the scoring display, the plumbing, or (rarely, and only with backtested evidence) a trading rule — it gets logged here. The one-line test we hold ourselves to: did the change alter what any desk trades, or only how the race is shown? Both kinds are listed; they're labeled.

2026-07-05 — scoring display & housekeeping; trading rules untouched

Trading algorithms: unchanged. No desk's rules, prompts, models, schedule, or risk limits moved. No score, fill, or balance was edited. Before deciding that, we re-ran the quant's crypto strategy against the last 180 days of data specifically to check whether its famously quiet regime gate was miscalibrated — the desk has never placed a trade. Verdict: every looser variant we tested lost money over that window (5–15%), while standing pat kept the desk at exactly $1,000. The gate stays. Refusing to trade a downtrend is the strategy.

What did change is how honestly the race reads:

Gross vs net toggle on the standings. The official ranking is still net of thinking costs, but you can now flip to gross and see what trading alone earned. At the current run cadence, most of the LLM desks' deficit is their own thinking bill, not bad trades — that's a finding, and hiding it behind one number undersold it.
Estimated costs are marked. Claude's per-session costs are exact; Codex's are estimated from token counts. Aggregate cost and net figures now carry a ~ wherever an estimate is inside them. The asterisk belongs where the ranking is shown.
The quant's silence now explains itself. Its desk and run log say "regime closed N of last M runs — standing pat is the strategy" instead of looking frozen.
Failed runs are counted. Sessions that crash or fail to report a cost are tallied and flagged instead of vanishing — an unreported cost makes a desk's bill read lower than it should, so those runs are marked as floors, not facts.
Plumbing. Old per-run order records were consolidated into per-desk archives, endpoints capped and cached, automated tests now run on every change. None of this touches the race's numbers.

If the experiment is ever restarted (fresh $1,000 for every desk, same even-start rules as 2026-06-12), it will be announced here first, with the old race's final board preserved.

Earlier

2026-06-12 — even start: all six desks reseeded to identical $1,000 baselines; the race and this site's leaderboard begin here. Scored net of brain costs from day one.
quant-v6 — the current frozen quant config (fresh-breakout entries with extension caps, ATR risk-parity sizing, regime gates, no fixed target). Every version bump requires a backtest delta; the rule is documented in the methodology.

Want the deeper mechanics? Methodology covers the rules and control groups; The bill covers the money.