white paper · v1.0 · 2026 may 26
Determinism under load.
A working definition of audit-grade inference, and what it costs.
An audit is only as deep as its replays.
engineering folklore, attribution unclear
Determinism, as a property of a system, is easy to define and hard to keep. A system is deterministic when the same input set, under the same configuration, produces the same output. The challenge is the second half — the same configuration — because configuration is the part that drifts.
Drift can be enumerated. Library upgrades, kernel patches, hardware substitutions, schema additions, prompt edits, the occasional cosmic rayWe exclude here genuinely non-deterministic operations — hardware RNG taps, wall-clock reads, network jitter. Those are out-of-scope by construction: an audit-grade system either avoids them or vendors them through a seeded interface.. Each is a transition that, on a long enough timeline, will change an output you swore last week was settled. The question is not whether your system drifts — it does — but whether you can prove what it did, when.
What we mean by audit-grade
Audit-grade is not the same as bit-stableThis distinction matters because bit-stability is impossible past certain hardware boundaries. FP determinism on a single-vendor GPU is one thing; across mixed accelerators it is fiction. Audit-grade is achievable; bit-stability is a strict subset.. A bit-stable system produces identical output for identical input across deploys. An audit-grade system produces output whose provenance can be reconstructed from a signed trace, whether or not the bits match.
The difference matters because real workloads make bit-stability prohibitively expensive — and frequently impossible — past a certain hardware boundary. Audit-grade asks a weaker question and answers it well: can you, at any later time, produce the trace that explains this output?
The signed-trace primitive
The atom is a signed execution trace. Every inference produces one:
{
"trace_id": "01HX8K3RYV5N7K2QZ9F8X4M6B0",
"input_hash": "sha256:7a3f…",
"config_hash": "sha256:c2e1…",
"model": "claude-sonnet-4.5@2025-09-15",
"tool_calls": [ … ],
"output": "…",
"signed_by": "tee:nv-h100-prod-04",
"signature": "ed25519:…"
}
The signature binds output to input + config at the time of executionWhether a field is "meaningful" is a policy decision, not a technical one. The infrastructure produces all fields; the spec says which fields must match on replay and which are permitted to drift.. Replay produces a new trace; the audit question is whether the new trace agrees with the original on the fields the spec marks as meaningful.
Trace IDs use ULIDs. Sortable by time, collision-resistant under load.
This is a quietly powerful inversion. We stop asking the system to be deterministic and instead ask it to be legible. Legibility is something you can engineer; determinism past a certain scale is something you can only approximate.
Replay under drift
Replay is the operation that converts a stored trace into a fresh trace and compares them. The comparison is field-wise against a policy.
In practice, the comparator is the entire product. A weak comparator that marks everything meaningful gives you the bit-stability problem in a hat. A permissive comparator that marks nothing meaningful gives you the comfort of a green checkmark with none of the substance.We have settled, empirically, on three tiers: identity (output must match exactly), equivalence (output must satisfy a structural predicate — same JSON schema, same tool sequence, etc.), and witnessed (output must be derivable from the same intermediate states under the spec's permitted transformations). Building the right comparator is the design work.
What this costs
Audit-grade inference is not free. The honest numbers, from our own deployment:
| measure | overhead | notes |
|---|---|---|
| inference latency | +6–9% | trace assembly + signing |
| storage | ~4 KB / trace | compressed, before tool I/O |
| build time | +30s | config-hash computation |
| eng time | the largest cost | comparator + policy design |
The first three are bounded.Storage scales linearly with throughput, of course. At 10 RPS sustained, you accumulate roughly 100 GB of traces per month. Cold-storage them; the access pattern is "investigate one trace from six months ago," not "scan everything weekly." The fourth is unbounded and is the real reason most systems don't have audit-grade traces — not the latency, not the storage, but the work of writing down which fields matter and defending that policy under pressure.
A working definition
Pulling the thread back:
A system is audit-grade if every execution produces a signed trace, and if a replay-policy comparator can derive, from any stored trace plus the current build, a verdict that distinguishes meaningful drift from incidental drift.
Three load-bearing words. Signed — the trace is binding under the signing authority. Replay-policy — the spec is written down and versioned. Comparator — the verdict is mechanical, not editorial.
The system that ships under this definition is slower, more expensive, and considerably more deployable into regulated environments than its bit-stable cousinIt is also, in our experience, the system that produces the better engineering culture around it. The discipline of writing down what counts as meaningful drift is the discipline of writing down what your system actually claims to do.. We think this is the right tradeoff. We think it generalizes.
The remainder of this work expands each of the three pieces — the signing authority, the policy schema, the comparator algebra — and reports failure modes from the deployments that have been wearing the design for a year.