dontosubstrate
A claim substrate for the age of generative abundance

Let models say everything.
Let reality decide what survives.

donto is a bitemporal, paraconsistent, evidence-first claim substrate, written in Rust on Postgres. Modern models emit an unbounded firehose of typed claims about anything — inventing the predicates as they go, for a fraction of a cent each. donto holds that firehose without collapsing it: contradictions are legal state, every claim is anchored to its source, and typing, alignment and identity are deferred to query time. A knowledge base that grows in all directions and prunes by reality.

The living substrate
live
41.6M+
statements
1.0M+
freely-minted predicates
66,777
contexts
2.5M
evidence links
6.4M
contested claim-pairs — held, not deleted
341
consumer namespaces
2,433
argument edges
Predicates the models invented — highest volume right now
ex:knownAs1.1Mmem:episodic/chunk485Kex:knownAtLocation331Kex:normalized_claims/text_span278Kex:datePrecision277Kex:whenText245Kex:meta/description164Kex:locatedIn156K
The thesis

Generative abundance changes the shape of the problem

For fifty years, knowledge graphs were scarce: every triple was expensive to author, so the schema came first and the facts came slowly. Large models invert that economy — GPTKB pulled 105M typed triples out of one mid-tier model at roughly $0.00009 per claim; AutoSchemaKG built a 900M-node graph with no predefined schema at all; and the cost keeps falling roughly an order of magnitude a year. The bottleneck is no longer extraction. It's trust.

01

Emit freely

No fixed schema, no pre-typed predicates. Models invent the predicate they need (~1M minted so far) and assert in every direction — that's the signature of abundance, not a bug to suppress.

02

Defer to query time

Typing, alignment, identity resolution and joining are not write-time gates. They're query-time judgments — composed on demand, reversible, never destroying the raw claim.

03

Prune by reality

The substrate stores contradiction paraconsistently and lets a claim's evidence, corroboration and lifecycle decide its standing. Reality is the verifier — not a curator.

The moat

The claim lifecycle — not the schema — is the product

Anyone can extract facts. The durable advantage is what happens after: how a claim earns or loses standing over time, with its evidence intact and its contradictions preserved. donto is built around an eight-step lifecycle.

  1. step 1

    Ingest

    Pull any source — a paper, a deed, a resume, a conversation. Register it as an immutable, content-addressed document the substrate can always retrieve.

  2. step 2

    Emit free

    The model emits unbounded claims, inventing predicates and axes as it goes. The only write-time invariant: an evidence anchor, or an explicit hypothesis flag.

  3. step 3

    Hold incompatible

    Contradictory claims are stored side-by-side as legal bitemporal state. A conflict is data, not a failed write.

  4. step 4

    Hypothesize

    Typed relationship hypotheses are proposed where lenses intersect — connections no schema author would have pre-typed.

  5. step 5

    Attach evidence

    Supports, rebuts, undercuts, qualifies — argument edges carry evidence and counter-evidence with full provenance.

  6. step 6

    Rank

    Claims are scored by value — information gain, novelty, downstream task-lift — not by accuracy alone.

  7. step 7

    Re-rank in time

    New evidence re-scores old hypotheses. The bitemporal record means standing compounds; nothing is frozen at write time.

  8. step 8

    Explain

    Answers are explained only from evidence already attached — faithful by construction, never narrative first.

Non-negotiable

Ten invariants, enforced in code

These aren't aspirations in a slide deck. Each invariant is enforced at the schema and API layer and guarded by 80 dedicated invariant test suites that run in CI — the substrate refuses to be a normal database.

I1
No claim without evidence
or an explicit hypothesis flag — the single write-time gate.
I2
No restricted source without policy
unknown policy defaults to restricted-pending-review, never public.
I3
No destructive overwrite
every correction, retraction and merge is append-only; any past state is reconstructable.
I4
Contradictions are preserved
incompatible claims produce argument edges and review obligations, not failed writes.
I5
Machine confidence is not maturity
a model's self-confidence can never promote a claim; standing is earned by evidence and review.
I6
Governance propagates to derivatives
derived claims inherit the most restrictive policy of their sources.
I7
Schema mappings are typed and scoped
no default 'sameness' — every alignment carries a relation type, scope and safety flags.
I8
Identity is a hypothesis, not a foreign key
same-person, same-place, same-concept are contested claims you query under a lens.
I9
Adapters must report information loss
any import or export that can't carry contradiction, time or governance says so, structurally.
I10
A release is a reproducible view
a named query plus policy, source and checksum manifests — never an ad-hoc export.
Running today

The machinery is live

Every piece of the thesis has a running counterpart on the substrate — watchable in real time at scanner.donto.org.

The alignment engine

“Defer to query time” is shipped, not a slide. An embedding fabric — 919K predicate vectors and 338K entity fingerprints (bge-small, HNSW-indexed) — plus a continuous alignment daemon proposes, adjudicates and materializes a 1M-row predicate closure. killedBy meets murderedBy at cosine 0.95 without anyone maintaining a synonym table.

pgvector1.26M vectors1M closure rowsquery-time folding

The always-on citer

Extraction (what was claimed) and anchoring (where in the source) are separate stages. Every extracted fact is post-processed by a semantic citer that attaches the exact evidence span — or honestly flags the claim as interpretation, never a bogus span. It separates what a source stated from what a model inferred, and doubles as a hallucination filter.

stated vs interpretedzero bogus spans

The gleaning loop

Models stop early by choice, not capacity. The extraction harness re-prompts the same source until saturation — one article went from 511 facts to 3,227 — and stops only after consecutive dry passes, because a count floor makes models pad with garbage. Saturation decides done; meaningful coverage is the goal.

511 → 3,227 factssaturation-stopped

One engine, many lanes

donto-extract is a single extraction engine with eight swappable model lanes — a declarative registry where each lane's caps and failure signatures are data, not if/else. Capped lanes rotate out automatically, pool-aware. The substrate doesn't care which model emitted a claim; the citer and the lifecycle hold every lane to the same evidence standard.

donto-extract8 lanesauto-failoverinjection-hardened
DontoQL

A query language with dimensions SQL doesn't have

Querying contested knowledge needs more than triples. DontoQL — implemented, with a SPARQL 1.1 subset compiling to the same engine — makes the substrate's dimensions first-class:

  • Predicate expansionPREDICATES EXPAND folds the learned alignment closure, so a question asked in your vocabulary finds claims minted in any other.
  • Identity lenses — query under strict, cluster or transitive same-as identity; the merge is a per-query choice, never a destructive write.
  • Bitemporal travelAS_OF and TRANSACTION_TIME AS_OF reconstruct what was true, and what the system believed, at any moment.
  • Policy-awarePOLICY ALLOWS filters by governance before content is ever touched.
  • Contradiction-ordered — sort by contradiction pressure to surface exactly where sources disagree.
a real query, run against the live substrate
MATCH ?person ex:diedAt ?place
SCOPE include ctx:genealogy
PREDICATES EXPAND
IDENTITY_LENS clusters
POLICY ALLOWS read_content
ORDER_BY contradiction_pressure DESC
LIMIT 25

One query: scoped to a context forest, predicate-expanded through the alignment closure, identity resolved under a chosen lens, policy-filtered, and ordered by where the evidence fights itself.

Measured, not claimed

Benchmarked on LongMemEval — reported honestly

donto-memory — the agent-memory consumer built on the substrate — was run through LongMemEval(ICLR 2025), the standard long-term-memory benchmark, under audited no-leakage conditions. The honest headline: where a whole history fits in a frontier model's context, raw accuracy ties a full-context reader — and the substrate earns its keep on retrieval quality, token cost, knowledge-update and abstention, the things that survive when histories outgrow any context window.

0.98

retrieval hit@10 on LongMemEval_s — up from 0.85 lexical-only; the hybrid vector arm is load-bearing

0.933

answer accuracy on a stratified LongMemEval_s sample — within a point of the 0.946 oracle ceiling

~2×

lower token cost than handing the reader the full history

1.0

abstention on unanswerable questions — evidence-first means knowing when not to answer

Full methodology, baselines and the uncomfortable parts in the LongMemEval study.

What it makes possible

Relationships no one ever thought to type

Point a model at the same entity through ten different lenses — philosophical, linguistic, temporal, causal, social, material — and it will emit properties and edges a hand-built schema would never have anticipated. You don't pre-type them. You let them accumulate, and resolve the joins when a question needs them.

Philosophical
essence, identity-over-time, mereology
Linguistic
sense, register, etymology, translation drift
Temporal
validity intervals, succession, anachronism
Causal
enables, prevents, is-evidence-for
Social
witness, sponsor, neighbor, FAN networks
Material
composition, provenance, location-over-time
For agents

Give any agent a memory that cites its sources

donto-memory ships an MCP server — three tools that turn any MCP-capable agent into one whose memory is anchored, recallable and substrate-wide. Install instructions, agent docs and the manifest live at mcp.donto.org.

donto_recalldonto_searchdonto_memorize
  • Recall — holder-scoped memory with hybrid lexical + vector retrieval
  • Search — full-text over the entire substrate, all consumers
  • Memorize — text in, anchored claims out, evidence spans attached
or speak HTTP directly
# remember something — bitemporal from day one
curl -X POST https://memories.apexpots.com/memorize \
  -H 'content-type: application/json' \
  -d '{"holder": "agent:you",
       "text": "Ada moved the API to Rust in March.",
       "valid_from": "2026-03-01"}'

# recall it — hybrid lexical + vector, holder-scoped
curl -X POST https://memories.apexpots.com/recall \
  -H 'content-type: application/json' \
  -d '{"holder": "agent:you", "query": "what runs the API?"}'
Under the hood

Built like infrastructure, because it is

A Rust workspace over Postgres: bitemporal ranges and content-hash idempotency at the schema layer, content-addressed blobs (SHA-256, GCS-backed) behind every document, a Trust Kernel for policy and attestations, Lean-backed shape validation, and importers for five linguistic corpus formats that report exactly what they couldn't carry.

22
Rust crates in the workspace
67
substrate API routes
127
SQL migrations
80
invariant test suites
23
Lean 4 modules
14
native object families

Bind your domain to the substrate.

One read-only discovery surface is all it takes to bind a new consumer. No SQL, no schema migration — just claims, evidence, and the lifecycle.