docs / embeddings

The embedding fabric

donto holds ~1M freely-minted predicates and ~4.1M distinct subjects, and reconciles none of them at write time. The embedding fabric is what makes that affordable: it turns predicates, entity fingerprints and memory text into 384-dimensional vectors so that killedBy can find assassinatedBy, and two differently-spelled records of the same person can find each other — by meaning, at query time.

predicate vectors 994K (98.6%)entity vectors 385Kmemory vectors 175K (+1.2M backlog)model bge-small-en-v1.5 · 384dindex pgvector 0.8.2 HNSW

Two rules shape everything

column	type	meaning
additive only	I3-safe	The fabric writes only to its own derived cache tables, never to donto_statement. Vectors propose similarity — alignment candidates, identity candidates, recall rankings — they never merge, retract or overwrite anything.
no brittle logic	non-negotiable	No synonym tables, no predicate-variant dicts, no string-case ladders anywhere. The code does structural tokenization of IRIs; the model supplies all semantics; salience weights are learned from the corpus (IDF), not curated.

The model: BAAI/bge-small-en-v1.5

Every component pins the same model — 384 dimensions, fastembed/ONNX runtime (no PyTorch, ~130MB), normalized geometry matching vector_cosine_ops. Why this one: 384d keeps million-row pgvector tables + HNSW indexes viable on a single Postgres box (the predicate table is 3.6GB including its index), and it's strong enough for synonym folding — measured: murdered → killed at cosine 0.947.

The model is pinned

Changing it requires re-embedding the entire fabric — every stored vector and every query vector must come from the same model or cosine distances are meaningless.

Throughput realities (all measured): the 4-core box does ~100 vectors/min; speed scales inversely with text length (400 chars → ~21/s, 3,000 → ~2.7/s), which is why every text builder truncates aggressively. There is no free 10× on CPU — a single GPU does ~2,000–6,000/min ≈ a 50–100-core fleet, which is why the distributed fabric exists.

What exactly gets embedded

The cardinal detail: vectors are only as good as the text fed to the model, and each of the three targets builds that text differently. All three share one primitive — humanize(): strip the CURIE/URL prefix, split camelCase/kebab/snake/digit boundaries, lowercase. Pure structure, no domain knowledge:

'ex:assassinatedBy'                -> 'assassinated by'
'partOfColonialFrontierMassacres'  -> 'part of colonial frontier massacres'
'flesch-kincaid_grade'             -> 'flesch kincaid grade'
'ex:caroline-rose-brown'           -> 'caroline rose brown'

1 · Predicates — the meaning of a relation

build_embed_text(): prefer the human-authored label/description when present, then always append the humanized IRI so the structural signal survives when they're NULL — the overwhelming case for LLM-minted predicates.

2 · Entities — a fingerprint of what is said about them

An entity is embedded by what is said about it, not just its name — so two records of the same person with different spellings land close because their described relations overlap. The signature: the name line, up to 3 pinned rdf:type lines, then up to 14 salient relations ranked by data-driven IDF (from the materialized view donto_predicate_idf): a predicate on nearly every subject carries no information and sinks; a rare discriminating relation floats up. Predicates minted once for one subject are demoted — they describe but can never match. Capped at 600 chars (short text embeds ~3× faster with no measured loss of matching power). A real live signature:

[SIG] ex:caroline-rose-brown
  Brown | needs cross reference with caroline birth registration |
  possible father brown roberts | previous identified as Caroline Molloy |
  previous identified as Caroline Davis | is progeny of brown roberts |
  has death date 1975-07-06 | has birth date 1887-06-10 |
  is sister of david molloy | has alias kitchay | has birth year 1887

Embedding is tiered by richness (most of the 4.1M subjects are 1-statement singletons with near-zero identity value): the daemon backfills richest-first through descending minimum-statement tiers (50 → 20 → 10 → 5). A sig_hash per entity means re-runs only re-embed entities whose signature actually changed.

3 · Memory chunks and semantic claims

Episodic chunks embed their raw prose, truncated to CHUNK_TEXT_CAP = 1200 chars (~300 tokens) — a measured cap: raising it to 2048 gave zero recall gain on LongMemEval while costing ~33% throughput; the signal sits in the first ~300 tokens. Semantic claims embed a humanized rendering of the triple — "ajax occupation software engineer" — which is what lets "where does he work?" reach an ex:occupation engineer claim that shares no substring with the query.

Storage — three derived-cache tables

all recomputable from donto_statement; all with HNSW (vector_cosine_ops) + BRIN(updated_at)

column	type	meaning
donto_predicate_embedding	994K rows · 3.6GB	`iri PK · embedding vector(384) · model · updated_at`. No staleness hash needed — predicate text effectively never mutates.
donto_entity_embedding	385K rows · 1.7GB	Adds `sig_hash` — sha256 of the signature text, so re-runs skip unchanged entities.
donto_x_memory_chunk_embedding	175K rows · 644MB	Keyed by `statement_id` (FK, no ON DELETE — the substrate never deletes; retracted statements' vectors are excluded by the `upper(tx_time) IS NULL` join every reader does). Plus `record_iri` and the denormalized `holder_iri` — added when holder-scoping the vector arm via a join into the HNSW index blew the 10s timeout at 41K-chunk-holder scale; with the btree, the planner does an exact distance sort over just the holder's vectors.

HNSW indexes use pgvector defaults (m=16, ef_construction=64). Query-time recall breadth: hnsw.ef_search defaults to 40; the alignment batch jobs set 100 at session level — deliberately not inside the SQL functions, because a composable function must not mutate GUCs.

Who computes the vectors

Two cooperating producers write the same tables with byte-identical text builders (the coordinator literally imports the daemon's modules — single source of truth, zero drift):

column	type	meaning
the alignment daemon	local, continuous	Every ~5min tick: embed missing predicates + entities (tier ladder), propose alignment/identity candidates, LLM-adjudicate small cap-aware batches, periodically rebuild the closure. Load-gated (skips when the box is busy); single-instance via flock + a Postgres advisory lock; upsert batches sorted by IRI so concurrent `ON CONFLICT` writers can't deadlock.
the distributed coordinator	:7930 · donto.org/embed	Remote workers (any machine, CPU or GPU) lease texts, embed locally, submit vectors — the worker is dumb and never touches the DB. `FOR UPDATE SKIP LOCKED` leases (no double work, ~15min stale-lease reclaim), a self-topping queue (refills toward 100K every 2min), dimension-validated submits, and worker error telemetry (`/embed/report`, surfaced on the admin console). Anyone can contribute via donto.org/help.
the query-time embedder	:7902	A tiny service that loads bge-small once and turns recall queries into vectors in milliseconds. Strictly optional: any failure → recall degrades gracefully to FTS-only.

How the vectors are actually used

Predicate alignment — candidates → adjudication → closure → folding

donto_suggest_alignments_semantic() runs an HNSW-accelerated nearest-neighbour scan. Real live output:

SET hnsw.ef_search = 100;
SELECT target_iri, sim FROM donto_suggest_alignments_semantic('someMembersMurderedBy', 0.80, 8);

 hadMembersKilled    0.9322
 murderedByGroup     0.9042
 wasMurderedByGroup  0.8991
 someMembersDied     0.8990
 killedOneMemberOf   0.8850 …

A hybrid generator combines this with trigram-lexical similarity (catching morphological variants like bornIn/wasBornIn that embeddings alone might rank lower); production floors: semantic ≥ 0.82, combined ≥ 0.88 to register a candidate. The daemon's LLM step adjudicates candidates into accepted alignments, which are flattened into donto_predicate_closure (~1.01M rows; ~9.8K real expansion edges: close_match, exact_equivalent, inverse_equivalent, sub_property_of). At query time, donto_match_aligned() returns direct matches plus everything reachable through the closure — re-orienting inverse-swapped rows — so a claim stored under ex:workplace answers a query about ex:employer. The emit-free / defer-joining promise, made real. Details in alignment & identity.

Entity identity candidates — never merges

SELECT target_iri, sim FROM donto_suggest_entity_matches('ex:caroline-rose-brown', 0.75, 6);

 ex:caroline-rose-davis   0.8483   <- same person, married name
 ex:caroline              0.8227
 ctx:genealogy/…/f7bc7b…  0.8222 …

The fingerprint approach working as designed: differently-spelled records of one person surface because what is said about them overlaps. Accepted pairs become governed identity proposals — identity stays a reversible hypothesis; nothing is ever merged.

Memory hybrid recall — the load-bearing arm

Recall fuses a lexical arm (FTS) and a vector arm (these embeddings, holder-scoped) with Reciprocal Rank Fusion (k=60). Measured on LongMemEval: adding the vector arm lifted overall hit@10 from 0.85 → 0.98 — and rescued exactly the question types where lexical recall is blind (preference questions 0.38 → 0.88, assistant-knowledge 0.62 → 1.00). The full recall pipeline is in how it solves things.

Raw access

-- nearest entities, cosine distance via pgvector
SELECT e.iri, 1 - (e.embedding <=> (SELECT embedding FROM donto_entity_embedding
                                    WHERE iri = 'ex:caroline-rose-brown')) AS cos_sim
FROM donto_entity_embedding e
ORDER BY e.embedding <=> (SELECT embedding FROM donto_entity_embedding
                          WHERE iri = 'ex:caroline-rose-brown')
LIMIT 5;
-- the operand must be a scalar subquery/constant for the HNSW index to be used

← previous

The evidence chain

Alignment & identity