Let models say everything.
Let reality decide what survives.
donto is a bitemporal, paraconsistent, evidence-first claim substrate, written in Rust on Postgres. Modern models emit an unbounded firehose of typed claims about anything — inventing the predicates as they go, for a fraction of a cent each. donto holds that firehose without collapsing it: contradictions are legal state, every claim is anchored to its source, and typing, alignment and identity are deferred to query time. A knowledge base that grows in all directions and prunes by reality.
Generative abundance changes the shape of the problem
For fifty years, knowledge graphs were scarce: every triple was expensive to author, so the schema came first and the facts came slowly. Large models invert that economy — GPTKB pulled 105M typed triples out of one mid-tier model at roughly $0.00009 per claim; AutoSchemaKG built a 900M-node graph with no predefined schema at all; and the cost keeps falling roughly an order of magnitude a year. The bottleneck is no longer extraction. It's trust.
Emit freely
No fixed schema, no pre-typed predicates. Models invent the predicate they need (~1M minted so far) and assert in every direction — that's the signature of abundance, not a bug to suppress.
Defer to query time
Typing, alignment, identity resolution and joining are not write-time gates. They're query-time judgments — composed on demand, reversible, never destroying the raw claim.
Prune by reality
The substrate stores contradiction paraconsistently and lets a claim's evidence, corroboration and lifecycle decide its standing. Reality is the verifier — not a curator.
The claim lifecycle — not the schema — is the product
Anyone can extract facts. The durable advantage is what happens after: how a claim earns or loses standing over time, with its evidence intact and its contradictions preserved. donto is built around an eight-step lifecycle.
- step 1
Ingest
Pull any source — a paper, a deed, a resume, a conversation. Register it as an immutable, content-addressed document the substrate can always retrieve.
- step 2
Emit free
The model emits unbounded claims, inventing predicates and axes as it goes. The only write-time invariant: an evidence anchor, or an explicit hypothesis flag.
- step 3
Hold incompatible
Contradictory claims are stored side-by-side as legal bitemporal state. A conflict is data, not a failed write.
- step 4
Hypothesize
Typed relationship hypotheses are proposed where lenses intersect — connections no schema author would have pre-typed.
- step 5
Attach evidence
Supports, rebuts, undercuts, qualifies — argument edges carry evidence and counter-evidence with full provenance.
- step 6
Rank
Claims are scored by value — information gain, novelty, downstream task-lift — not by accuracy alone.
- step 7
Re-rank in time
New evidence re-scores old hypotheses. The bitemporal record means standing compounds; nothing is frozen at write time.
- step 8
Explain
Answers are explained only from evidence already attached — faithful by construction, never narrative first.
Ten invariants, enforced in code
These aren't aspirations in a slide deck. Each invariant is enforced at the schema and API layer and guarded by 80 dedicated invariant test suites that run in CI — the substrate refuses to be a normal database.
The machinery is live
Every piece of the thesis has a running counterpart on the substrate — watchable in real time at scanner.donto.org.
The alignment engine
“Defer to query time” is shipped, not a slide. An embedding fabric — 919K predicate vectors and 338K entity fingerprints (bge-small, HNSW-indexed) — plus a continuous alignment daemon proposes, adjudicates and materializes a 1M-row predicate closure. killedBy meets murderedBy at cosine 0.95 without anyone maintaining a synonym table.
The always-on citer
Extraction (what was claimed) and anchoring (where in the source) are separate stages. Every extracted fact is post-processed by a semantic citer that attaches the exact evidence span — or honestly flags the claim as interpretation, never a bogus span. It separates what a source stated from what a model inferred, and doubles as a hallucination filter.
The gleaning loop
Models stop early by choice, not capacity. The extraction harness re-prompts the same source until saturation — one article went from 511 facts to 3,227 — and stops only after consecutive dry passes, because a count floor makes models pad with garbage. Saturation decides done; meaningful coverage is the goal.
One engine, many lanes
donto-extract is a single extraction engine with eight swappable model lanes — a declarative registry where each lane's caps and failure signatures are data, not if/else. Capped lanes rotate out automatically, pool-aware. The substrate doesn't care which model emitted a claim; the citer and the lifecycle hold every lane to the same evidence standard.
A query language with dimensions SQL doesn't have
Querying contested knowledge needs more than triples. DontoQL — implemented, with a SPARQL 1.1 subset compiling to the same engine — makes the substrate's dimensions first-class:
- Predicate expansion —
PREDICATES EXPANDfolds the learned alignment closure, so a question asked in your vocabulary finds claims minted in any other. - Identity lenses — query under strict, cluster or transitive same-as identity; the merge is a per-query choice, never a destructive write.
- Bitemporal travel —
AS_OFandTRANSACTION_TIME AS_OFreconstruct what was true, and what the system believed, at any moment. - Policy-aware —
POLICY ALLOWSfilters by governance before content is ever touched. - Contradiction-ordered — sort by contradiction pressure to surface exactly where sources disagree.
MATCH ?person ex:diedAt ?place
SCOPE include ctx:genealogy
PREDICATES EXPAND
IDENTITY_LENS clusters
POLICY ALLOWS read_content
ORDER_BY contradiction_pressure DESC
LIMIT 25One query: scoped to a context forest, predicate-expanded through the alignment closure, identity resolved under a chosen lens, policy-filtered, and ordered by where the evidence fights itself.
Benchmarked on LongMemEval — reported honestly
donto-memory — the agent-memory consumer built on the substrate — was run through LongMemEval(ICLR 2025), the standard long-term-memory benchmark, under audited no-leakage conditions. The honest headline: where a whole history fits in a frontier model's context, raw accuracy ties a full-context reader — and the substrate earns its keep on retrieval quality, token cost, knowledge-update and abstention, the things that survive when histories outgrow any context window.
retrieval hit@10 on LongMemEval_s — up from 0.85 lexical-only; the hybrid vector arm is load-bearing
answer accuracy on a stratified LongMemEval_s sample — within a point of the 0.946 oracle ceiling
lower token cost than handing the reader the full history
abstention on unanswerable questions — evidence-first means knowing when not to answer
Full methodology, baselines and the uncomfortable parts in the LongMemEval study.
Relationships no one ever thought to type
Point a model at the same entity through ten different lenses — philosophical, linguistic, temporal, causal, social, material — and it will emit properties and edges a hand-built schema would never have anticipated. You don't pre-type them. You let them accumulate, and resolve the joins when a question needs them.
One substrate, many consumers
donto stays infrastructure. Everything else is an example of binding a domain to it — proof that the same substrate serves wildly different consumers.
Persistent memory for agents
Every message becomes anchored, recallable claims. Hybrid lexical + vector recall, bitemporal knowledge-update, evidence-first abstention — benchmarked on LongMemEval. Speaks MCP, so any agent can plug in.
Evidence-first family research
The hardest test of a claim substrate: contradictory sources, century-old records, identity that is itself a hypothesis. Every fact retains its source snippet and the full resource behind it.
The substrate, watched live
A real-time monitor of the substrate itself: contexts as sectors, claims arriving as packets, contradictions surfacing as they're detected. The firehose, visible.
Give any agent a memory that cites its sources
donto-memory ships an MCP server — three tools that turn any MCP-capable agent into one whose memory is anchored, recallable and substrate-wide. Install instructions, agent docs and the manifest live at mcp.donto.org.
- Recall — holder-scoped memory with hybrid lexical + vector retrieval
- Search — full-text over the entire substrate, all consumers
- Memorize — text in, anchored claims out, evidence spans attached
# remember something — bitemporal from day one
curl -X POST https://memories.apexpots.com/memorize \
-H 'content-type: application/json' \
-d '{"holder": "agent:you",
"text": "Ada moved the API to Rust in March.",
"valid_from": "2026-03-01"}'
# recall it — hybrid lexical + vector, holder-scoped
curl -X POST https://memories.apexpots.com/recall \
-H 'content-type: application/json' \
-d '{"holder": "agent:you", "query": "what runs the API?"}'Built like infrastructure, because it is
A Rust workspace over Postgres: bitemporal ranges and content-hash idempotency at the schema layer, content-addressed blobs (SHA-256, GCS-backed) behind every document, a Trust Kernel for policy and attestations, Lean-backed shape validation, and importers for five linguistic corpus formats that report exactly what they couldn't carry.
The research behind donto
The Abundance Substrate
The canonical thesis: emit freely, defer to query time, prune by reality.
Read →LongMemEval Study
A no-shortcuts benchmark of donto-memory — wins, ties and gaps reported plainly.
Read →Extraction Engineering
Gleaning loops, the always-on citer, and why count is the wrong target.
Read →More at the research index — bakeoffs, deep-extraction studies, the substrate PRD.
Bind your domain to the substrate.
One read-only discovery surface is all it takes to bind a new consumer. No SQL, no schema migration — just claims, evidence, and the lifecycle.