The embedding fabric
donto holds ~1M freely-minted predicates and ~4.1M distinct subjects, and reconciles none of them at write time. The embedding fabric is what makes that affordable: it turns predicates, entity fingerprints and memory text into 384-dimensional vectors so that killedBy can find assassinatedBy, and two differently-spelled records of the same person can find each other — by meaning, at query time.
Two rules shape everything
| column | type | meaning |
|---|---|---|
| additive only | I3-safe | The fabric writes only to its own derived cache tables, never to donto_statement. Vectors propose similarity — alignment candidates, identity candidates, recall rankings — they never merge, retract or overwrite anything. |
| no brittle logic | non-negotiable | No synonym tables, no predicate-variant dicts, no string-case ladders anywhere. The code does structural tokenization of IRIs; the model supplies all semantics; salience weights are learned from the corpus (IDF), not curated. |
The model: BAAI/bge-small-en-v1.5
Every component pins the same model — 384 dimensions, fastembed/ONNX runtime (no PyTorch, ~130MB), normalized geometry matching vector_cosine_ops. Why this one: 384d keeps million-row pgvector tables + HNSW indexes viable on a single Postgres box (the predicate table is 3.6GB including its index), and it's strong enough for synonym folding — measured: murdered → killed at cosine 0.947.
Throughput realities (all measured): the 4-core box does ~100 vectors/min; speed scales inversely with text length (400 chars → ~21/s, 3,000 → ~2.7/s), which is why every text builder truncates aggressively. There is no free 10× on CPU — a single GPU does ~2,000–6,000/min ≈ a 50–100-core fleet, which is why the distributed fabric exists.
What exactly gets embedded
The cardinal detail: vectors are only as good as the text fed to the model, and each of the three targets builds that text differently. All three share one primitive — humanize(): strip the CURIE/URL prefix, split camelCase/kebab/snake/digit boundaries, lowercase. Pure structure, no domain knowledge:
'ex:assassinatedBy' -> 'assassinated by'
'partOfColonialFrontierMassacres' -> 'part of colonial frontier massacres'
'flesch-kincaid_grade' -> 'flesch kincaid grade'
'ex:caroline-rose-brown' -> 'caroline rose brown'1 · Predicates — the meaning of a relation
build_embed_text(): prefer the human-authored label/description when present, then always append the humanized IRI so the structural signal survives when they're NULL — the overwhelming case for LLM-minted predicates.
2 · Entities — a fingerprint of what is said about them
An entity is embedded by what is said about it, not just its name — so two records of the same person with different spellings land close because their described relations overlap. The signature: the name line, up to 3 pinned rdf:type lines, then up to 14 salient relations ranked by data-driven IDF (from the materialized view donto_predicate_idf): a predicate on nearly every subject carries no information and sinks; a rare discriminating relation floats up. Predicates minted once for one subject are demoted — they describe but can never match. Capped at 600 chars (short text embeds ~3× faster with no measured loss of matching power). A real live signature:
[SIG] ex:caroline-rose-brown
Brown | needs cross reference with caroline birth registration |
possible father brown roberts | previous identified as Caroline Molloy |
previous identified as Caroline Davis | is progeny of brown roberts |
has death date 1975-07-06 | has birth date 1887-06-10 |
is sister of david molloy | has alias kitchay | has birth year 1887Embedding is tiered by richness (most of the 4.1M subjects are 1-statement singletons with near-zero identity value): the daemon backfills richest-first through descending minimum-statement tiers (50 → 20 → 10 → 5). A sig_hash per entity means re-runs only re-embed entities whose signature actually changed.
3 · Memory chunks and semantic claims
Episodic chunks embed their raw prose, truncated to CHUNK_TEXT_CAP = 1200 chars (~300 tokens) — a measured cap: raising it to 2048 gave zero recall gain on LongMemEval while costing ~33% throughput; the signal sits in the first ~300 tokens. Semantic claims embed a humanized rendering of the triple — "ajax occupation software engineer" — which is what lets "where does he work?" reach an ex:occupation engineer claim that shares no substring with the query.
Storage — three derived-cache tables
| column | type | meaning |
|---|---|---|
| donto_predicate_embedding | 994K rows · 3.6GB | iri PK · embedding vector(384) · model · updated_at. No staleness hash needed — predicate text effectively never mutates. |
| donto_entity_embedding | 385K rows · 1.7GB | Adds sig_hash — sha256 of the signature text, so re-runs skip unchanged entities. |
| donto_x_memory_chunk_embedding | 175K rows · 644MB | Keyed by statement_id (FK, no ON DELETE — the substrate never deletes; retracted statements' vectors are excluded by the upper(tx_time) IS NULL join every reader does). Plus record_iri and the denormalized holder_iri — added when holder-scoping the vector arm via a join into the HNSW index blew the 10s timeout at 41K-chunk-holder scale; with the btree, the planner does an exact distance sort over just the holder's vectors. |
HNSW indexes use pgvector defaults (m=16, ef_construction=64). Query-time recall breadth: hnsw.ef_search defaults to 40; the alignment batch jobs set 100 at session level — deliberately not inside the SQL functions, because a composable function must not mutate GUCs.
Who computes the vectors
Two cooperating producers write the same tables with byte-identical text builders (the coordinator literally imports the daemon's modules — single source of truth, zero drift):
| column | type | meaning |
|---|---|---|
| the alignment daemon | local, continuous | Every ~5min tick: embed missing predicates + entities (tier ladder), propose alignment/identity candidates, LLM-adjudicate small cap-aware batches, periodically rebuild the closure. Load-gated (skips when the box is busy); single-instance via flock + a Postgres advisory lock; upsert batches sorted by IRI so concurrent ON CONFLICT writers can't deadlock. |
| the distributed coordinator | :7930 · donto.org/embed | Remote workers (any machine, CPU or GPU) lease texts, embed locally, submit vectors — the worker is dumb and never touches the DB. FOR UPDATE SKIP LOCKED leases (no double work, ~15min stale-lease reclaim), a self-topping queue (refills toward 100K every 2min), dimension-validated submits, and worker error telemetry (/embed/report, surfaced on the admin console). Anyone can contribute via donto.org/help. |
| the query-time embedder | :7902 | A tiny service that loads bge-small once and turns recall queries into vectors in milliseconds. Strictly optional: any failure → recall degrades gracefully to FTS-only. |
How the vectors are actually used
Predicate alignment — candidates → adjudication → closure → folding
donto_suggest_alignments_semantic() runs an HNSW-accelerated nearest-neighbour scan. Real live output:
SET hnsw.ef_search = 100;
SELECT target_iri, sim FROM donto_suggest_alignments_semantic('someMembersMurderedBy', 0.80, 8);
hadMembersKilled 0.9322
murderedByGroup 0.9042
wasMurderedByGroup 0.8991
someMembersDied 0.8990
killedOneMemberOf 0.8850 …A hybrid generator combines this with trigram-lexical similarity (catching morphological variants like bornIn/wasBornIn that embeddings alone might rank lower); production floors: semantic ≥ 0.82, combined ≥ 0.88 to register a candidate. The daemon's LLM step adjudicates candidates into accepted alignments, which are flattened into donto_predicate_closure (~1.01M rows; ~9.8K real expansion edges: close_match, exact_equivalent, inverse_equivalent, sub_property_of). At query time, donto_match_aligned() returns direct matches plus everything reachable through the closure — re-orienting inverse-swapped rows — so a claim stored under ex:workplace answers a query about ex:employer. The emit-free / defer-joining promise, made real. Details in alignment & identity.
Entity identity candidates — never merges
SELECT target_iri, sim FROM donto_suggest_entity_matches('ex:caroline-rose-brown', 0.75, 6);
ex:caroline-rose-davis 0.8483 <- same person, married name
ex:caroline 0.8227
ctx:genealogy/…/f7bc7b… 0.8222 …The fingerprint approach working as designed: differently-spelled records of one person surface because what is said about them overlaps. Accepted pairs become governed identity proposals — identity stays a reversible hypothesis; nothing is ever merged.
Memory hybrid recall — the load-bearing arm
Recall fuses a lexical arm (FTS) and a vector arm (these embeddings, holder-scoped) with Reciprocal Rank Fusion (k=60). Measured on LongMemEval: adding the vector arm lifted overall hit@10 from 0.85 → 0.98 — and rescued exactly the question types where lexical recall is blind (preference questions 0.38 → 0.88, assistant-knowledge 0.62 → 1.00). The full recall pipeline is in how it solves things.
Raw access
-- nearest entities, cosine distance via pgvector
SELECT e.iri, 1 - (e.embedding <=> (SELECT embedding FROM donto_entity_embedding
WHERE iri = 'ex:caroline-rose-brown')) AS cos_sim
FROM donto_entity_embedding e
ORDER BY e.embedding <=> (SELECT embedding FROM donto_entity_embedding
WHERE iri = 'ex:caroline-rose-brown')
LIMIT 5;
-- the operand must be a scalar subquery/constant for the HNSW index to be used