How a coupling earns its place.
The engine does not trust itself. It does not trust any single model. It does not trust a clever algebraic match. Every claim must survive four independent levels of evidence and a four-phase peer review modelled directly on academic practice.
The thesis
The difference between a coincidence and a genuine cross-domain coupling is executable realization. A real coupling admits a physical system in which both equations hold simultaneously, on the same substrate, and produce the same measurable answer. Newton plus Hooke is real because mass–spring oscillators exist in laboratories and the engine can simulate one and compare against a measured period. Newton plus Shannon entropy is not real because no laboratory object obeys both equations at once. Every prior attempt at cross-domain discovery — SINDy, AI Feynman, PySR, SemGen, DARPA SKEMA — argued about symbols and ignored execution. This engine runs experiments.
Evidence hierarchy
Every coupling hypothesis accumulates evidence at four levels, each stricter than the last. Confidence is multiplicative — any level failing drags the whole down.
Canonical problem
Each equation in the corpus has a real problem with a known reference value. The equation passes L1 only if the problem's pytest unit test executes and the computed answer matches the reference within declared tolerance.
Composite notebook
When two equations are coupled, a real Jupyter notebook solves the composite system numerically and asserts against a board-sourced reference value. The notebook runs in a sandboxed container with no internet, a pinned image SHA, and a content-hashed ledger entry. No hidden human-in-the-loop.
Literature citation
At least two of the three AI reviewers must produce independent primary-source citations to the coupling from established textbooks or peer-reviewed papers. Shared-source or paraphrased citations do not count — DOI-level independence is required.
Live sensor agreement
The strongest level. A real physical sensor (phone IMU, weather station, smart meter, market feed, or quantum computer) produces measurements that match the composite prediction within a declared noise envelope, with Bayesian posterior updates over a rolling window.
Promotion thresholds
Labels are strictly monotonic in rigor. A coupling only reaches the higher tiers by passing more evidence levels and a stricter consensus gate.
| Label | Confidence | Required levels | Board gate |
|---|---|---|---|
| CONJECTURAL | C < 0.30 | any | — |
| EMPIRICAL | C ≥ 0.30 | L1 + L2 | ≥ 2 of 3 approve |
| PROVED | C ≥ 0.60 | L1 + L2 + L3 | unanimous approve |
| GROUNDED | C ≥ 0.85 | L1 + L2 + L3 + L4 | unanimous + live sensor |
A sidebar class DUALITY applies to mathematical isomorphisms such as Wick rotation or log-price substitution. These cannot be promoted past EMPIRICAL on symbolic evidence alone — they need independent physical sensor agreement from both domains. Mathematical equivalence is not a physical coupling.
The four-phase review protocol
Every coupling, and every architectural decision, is reviewed by a board that mirrors real academic peer review. The protocol runs in four explicit phases and produces a binding editorial decision. The three core reviewers are Claude Opus 4.6, OpenAI GPT-5, and Gemini 2.5 Pro, each assigned a specialist persona matched to the topic under review. Grok-4 (xAI) is enrolled as an optional fourth seat that is included on high-stakes governance rounds — architecture design reviews, scope-frozen discharge rounds, brand board reviews — and omitted on operational lookups such as canonical-problem reference sourcing. Whether a given round ran as a 3-seat or 4-seat board is recorded in its ledger entry.
Independent draft
Each reviewer receives the submission in complete isolation. No cross-contamination. Every reply begins with a conflict-of-interest self-declaration (did this reviewer encounter the specific formulation during training?), then a five-dimension rubric score, then a draft verdict with strengths and concerns.
Informed vote
The three Phase A drafts are consolidated into a board briefing document showing each reviewer what the others said. Each reviewer then produces a final updated verdict, explicitly responding to peer concerns — agreeing, disagreeing, or updating their position. This is the phase that breaks correlated blind spots.
Editor synthesis
A fourth distinct model call reads all the Phase B verdicts and writes the binding editorial decision. The editor weights reasoning quality, not vote count, and can override a naive tally when one reviewer's argument is visibly stronger. Reviewer disagreement on fundamentals is explicitly flagged in the ledger.
Bounded rebuttal
If the editor returns MODIFY with specific required changes, the submitter is allowed one revision cycle to address them, and the board runs again on the revision. A maximum of two rebuttal cycles are allowed before the decision becomes final. This mirrors the author-response phase standard at CS conferences.
The rubric
Every reviewer produces a structured five-dimension score on a 1–5 integer scale. For coupling hypotheses the dimensions are:
- Physical plausibility — does this coupling describe real physics?
- Dimensional consistency — are the transducer dimensions honest?
- Literature grounding — is there independent primary-source support?
- Falsifiability — can the composite be executed and compared against a reference?
- Novelty / emergent insight — does the coupling produce a new dimensionless group or conserved quantity that neither equation exposes alone?
Rubric totals are computed by the editor as weighted sums. Two reviewers scoring a coupling 5 across the board carry more weight in the editor's decision than a single reviewer scoring 1s without reasoning.
Conflict of interest
At the start of every Phase A reply, each reviewer declares whether it encountered the specific formulation during training. A declared COI does not disqualify the reviewer — all four models share some training on foundational physics literature — but the declaration is logged. If every reviewer seated in the round declares training exposure to an exact formulation, the coupling is flagged with HIGH_COI_CORRELATION_RISK and cannot be promoted past EMPIRICAL without independent experimental evidence at L4.
Three-way verification of board-sourced references
Reference values for canonical problems come from the board, not from the architect. The board is asked for {formula_symbolic, substitutions, value, citation} and the architect then runs three independent checks locally before accepting the value:
- Symbolic equivalence —
sp.simplify(board_formula − local_formula) == 0. A stronger guarantee than any single-point numeric check. - Property-based randomized sampling — 50 unit-consistent parameter draws, all must satisfy the board's claimed relationship within tolerance.
- Cross-CAS high precision —
mpmathat 50 decimal places vs sympy single-precision. Disagreement at > 10 decimal places flags the formula as numerically fragile.
All three must pass. A single-point agreement alone is insufficient. This breaks the circularity risk — the architect's local code is an independent check that does not depend on any model's training data.
The append-only ledger
Every hypothesis, every rejection, every board vote, every sensor reading, every composite execution result is recorded in an append-only JSONL file with a SHA-256 hash chain. Nothing is deleted. The ledger is the single source of truth for what the engine has ever tried and what happened. A future audit can re-derive every promotion decision from its raw evidence.
Sandboxed notebook execution
Every auto-generated composite notebook runs inside a Docker container with pinned image SHA, frozen pip requirements with hashes, no network access, read-only root filesystem, an ephemeral 64 MB tmpfs workspace, non-root UID, memory and CPU cgroup limits, and a per-cell execution timeout. The container's SHA and the notebook's content hash are both logged in the ledger, so the execution environment is reproducible and tamper-evident.
Academic analogs
The four-phase protocol is not invented — it mirrors standard peer review. For the record, the closest academic correspondences:
- Phase A independent drafts — standard practice at Nature, Science, NeurIPS, ICML, ICLR, PRL.
- Phase B informed response — CS conference rebuttal rounds (NeurIPS, ICML, ICLR, CVPR).
- Phase C editor synthesis — handling editors at journals, area chairs at conferences.
- Phase D bounded rebuttal — author response phase, limited to one or two cycles.
- COI self-declaration — mandatory at every legitimate venue.
- Specialist reviewer assignment — universal editorial practice.
- Scoring rubrics — standard at CS conferences; emerging at journals.
- Pre-registration of hypotheses — Nosek et al. 2018, the preregistration revolution.
v5 rules added after Round 2 discharge
The board's discharge round produced four additional binding rules beyond the v4 FCS fixes, each closing a specific structural loophole the engine hit in practice:
Board is not a CAS
The first Phase 12 canonical run caught a language-model arithmetic error: correct formula, correct citations, imprecise decimal. Never ask the board for a precision-critical numerical value. Query for formula and citations; compute the number locally with sympy + mpmath + python-flint. The board's value_attempt is logged as capability data but never load-bearing.
All must respond
Non-response from any board member (HTTP error, timeout, parse failure) counts as dissent, not neutral. Rounds cannot approve with partial quorum. Every model call has 5-retry exponential backoff (2→4→8→16→32 s). Persistent non-response escalates to the user with a Decision Matrix of retry / proceed-with-acknowledged-dissent / wait-for-vendor.
Scope-frozen discharge
After Round 1 produces the Frozen Concern Set, Round 2+ reviewers can only score each FCS item as addressed / partially / not-addressed — they cannot raise new concerns. Process violations are logged and escalated to the user as the only authority that can reopen the FCS. This breaks the infinite-review-drift that would otherwise keep the architecture perpetually in flight.
No corner cuts on failure
On any verification failure, the Failure Investigation Protocol (§11h) fires: state audit, assumption audit, literature RAG, board failure review, ledger entry. Tolerance loosening is forbidden. Equation modifications require explicit user approval on a board equation_inadequate verdict. The default hypothesis when theory and experiment disagree is that the experiment is incomplete — not that the theory is wrong.
The Nobel-prize bar
The engine's output standard is external academic peer review, not internal board agreement. Every PROVED or GROUNDED coupling must pass an academic readiness checklist before promotion:
- Quantitative prediction with declared uncertainty
- Systematic error analysis
- Independent replication pathway with listed equipment and parameter ranges
- Priority / novelty search with ≥10 primary-source citations (not just the 3 the RAG finds)
- Honest limitations section — longer than the strengths section by default
- Multiple-testing correction across the adversarial pre-registration set
- Prior-work engagement including adversarial citations
- Hall-of-mirrors check — the claim must be re-state-able outside our framework (portable to Mathematica or Modelica)
A coupling missing any checklist item stays at EMPIRICAL regardless of board consensus. Internal board agreement is necessary but not sufficient.