Truth at query time
Extractors freely mint predicates (bornIn, wasBornIn, birthplaceOf…) and entity IRIs. Nothing is deduplicated at write time. Four cooperating subsystems reconcile the abundance later — reversibly, and as data: a continuous daemon, a predicate closure that queries fold through, identity as a per-hypothesis resolution, and a contradiction layer where both sides stay live.
The continuous alignment daemon
donto-align-daemon.service runs a self-pacing tick (~10min): embed missing predicates + entities (richest tier first) → propose alignment candidates (cheap, no LLM) → adjudicate small LLM batches (cap-aware, on the flat GLM lane) → periodically rebuild the closure + identity clusters → record a heartbeat and an append-only tick row.
| column | type | meaning |
|---|---|---|
| single instance, twice over | flock + advisory lock | An exclusive flock plus a Postgres session advisory lock — even a deleted lock file can't yield two daemons. |
| load-aware | skips when busy | Checks system load, active backends, and whether an extraction is running; records load_skip and yields rather than competing. |
| cap-aware | structured backoff | LLM-quota caps are signalled by dedicated exit codes — never by substring-scanning logs (a confidence value containing '429' once false-tripped that). |
| DB-truth rebuild trigger | no in-memory counters | Pending-accept counts come from committed rows, so accepts from before a crash are never stranded. |
| I3-safe | additive only | Writes embeddings, candidates, reversible proposals, derived caches, and its own telemetry. Never touches donto_statement. |
Predicate alignment → closure → folding
Propose: hybrid lexical OR semantic
A candidate qualifies by clearing either the trigram bar (morphological variants: bornIn/wasBornIn) or the embedding-cosine bar (true synonyms with no shared characters: killedBy/murderedBy). The target must be more popular than the source — the rare freshly-minted variant aligns to the established term (e.g. rdfType at 6.7K uses → rdf:type at 4.1M), which also kills cycles. Production floors: semantic ≥ 0.82, combined ≥ 0.88.
Adjudicate: an LLM types the relation, grounded in usage
Similarity can say two predicates are related; it cannot say the relation type — direction, containment, or that near-identical strings are not equivalent. The adjudicator shows the LLM each predicate's labels plus up to 6 real (subject, object) usage pairs from the substrate, and records a verdict from a closed vocabulary: exact_equivalent · inverse_equivalent · sub_property_of · close_match · not_equivalent. Verdicts below the 0.80 floor stay candidates; explicit negatives are recorded too, so look-alikes are never re-proposed. The full audit trail (similarities, generator, the model's reasoning) lives in the alignment row's provenance.
The ledger and its safety flags
donto_predicate_alignment (24,228 rows; bitemporal, append-only) carries three independent safety flags: safe_for_query_expansion (default true), safe_for_export (false), safe_for_logical_inference (false) — an alignment good enough to widen recall is not automatically good enough to export or reason over. Only accepted rows reach the closure.
The closure: a flat table queries can join
donto_predicate_closure (1.01M rows = one self row per predicate + ~9.8K real expansion edges) is rebuilt atomically — staged in a temp table, then swapped in one transaction, so readers never see a half-built closure. Each row says: a query for predicate A should also match statements stored under B, via this relation, at this confidence, swapping subject/object if it's an inverse.
Query-time folding — run live
donto_match_aligned() is the standard matcher plus closure expansion. A real fold, executed on the live database: ex:mrs-e-e-brackenridge has her birthplace stored under the freely-minted wasBornIn; querying the established bornIn returns both — nobody maintains a synonym table:
SELECT subject, predicate, object_iri, matched_via, alignment_confidence
FROM donto_match_aligned(p_subject := 'ex:mrs-e-e-brackenridge', p_predicate := 'bornIn');
ex:mrs-e-e-brackenridge | bornIn | ex:adelaide | direct | 1.00
ex:mrs-e-e-brackenridge | wasBornIn | ex:adelaide | exact_equivalent | 0.95And the inverse-swap branch — asking a question in the opposite orientation to how the fact was stored:
-- stored: ex:robert-dawson killedBy ex:unruly-horse
SELECT * FROM donto_match_aligned(p_subject := 'ex:unruly-horse', p_predicate := 'killed');
ex:unruly-horse | killedBy | ex:robert-dawson | inverse_equivalent | 0.95
-- subject/object swapped back so the row reads in the caller's orientationOther live folds: birthplaceOf ↔ bornIn (inverse), diedOf ↔ causeOfDeath (inverse), killed → murdered (sub-property: querying the general term also returns the specific), affiliated-with ↔ affiliatedWith. Alignment widens matching, never visibility — consumers still post-filter by their own scope.
Identity as hypothesis — never a merge
donto never merges entities. "Same referent" is data: a reversible proposal → a governed pairwise edge → a cluster cache keyed by hypothesis. Which entities exist depends on which identity policy you query under.
| column | type | meaning |
|---|---|---|
| donto_identity_proposal | 221 rows | The reversible front door: same_as, different_from, merge_candidate, split_candidate, alias_of… with method (human/rule/model/registry/cross-source), confidence, and a status history. |
| donto_identity_edge | 124 rows | The asserted pairwise layer: same_referent · possibly_same_referent · distinct_referent · not_enough_information. Bitemporal; retraction closes the range. |
| donto_identity_hypothesis | 13 rows | Named identity policies with clustering thresholds — live: strict (0.98), likely (0.85), exploratory (0.60), plus human curation hypotheses for specific genealogy disambiguations. |
| donto_identity_cluster_cache | 1,026 rows | The derived per-hypothesis resolution: connected components over same_referent edges above the hypothesis's threshold; rep = min symbol id; invalidated by trigger on any edge change. |
A real identity, resolved three ways (live)
Discord ingestion minted both ex:traves-theberge and ex:traves_theberge (hyphen vs underscore). Fingerprint embeddings put the pair at cosine 0.84; LLM adjudication judged same_referent @ 0.95; the proposal was accepted and became identity edge #124. Resolving the underscore IRI under each hypothesis:
SELECT h.name, h.threshold_same, donto_identity_resolve_iri(h.hypothesis_id, 'ex:traves_theberge');
strict | 0.98 | ex:traves_theberge -- 0.95 < 0.98: NOT merged under strict
likely | 0.85 | ex:traves-theberge -- clustered: resolves to the representative
exploratory | 0.60 | ex:traves-thebergestrict) restores two entities.The contradiction machinery
Paraconsistency is three mechanics, none of which delete anything:
| column | type | meaning |
|---|---|---|
| polarity is data | not deletion | A claim and its negation are two live rows — flags carry asserted/negated/absent/unknown. |
| conflict is an edge | not an invalidation | donto_argument (2,433 rows; 2,225 rebuts) links incompatible claims with typed edges: supports, rebuts, undercuts, qualifies, explains, alternative_analysis_of, same_evidence_different_analysis, supersedes… Both sides keep matching queries. |
| re-ranking | not resolution | donto_paraconsistency_density (140,675 rows) pre-aggregates contestedness per subject/predicate window — a Shannon-entropy conflict score — so read paths can rank by it without an O(N²) scan. Retraction exists, but it's an explicit governance act, never an automatic consequence of conflict. |
Where the rebuts edges come from
An epistemic sweep marks genealogy predicates functional (one true value: ex:birthYear, ex:gender, ex:birthPlace…), finds subjects with multiple distinct asserted values, and creates rebuts pairs in a dedicated context — bounded scans, idempotent inserts. Some detected conflicts (gender female vs Female vs F) are really value-normalization gaps — exactly what the value-mapping and literal-canonicalization tables are built for as they come online.
SELECT predicate, total_score, windows
FROM donto_v_top_contested_predicates ORDER BY total_score DESC LIMIT 5;
rdf:type | 49715.9 | 53713
rdfs:label | 14155.4 | 16182
rdfType | 1774.2 | 1999
locatedIn | 1246.2 | 1367
interrogationInterrogator | 842.6 | 846Contestedness is the steering wheel: it tells the discovery lenses where reality is disputed and worth another look.
The whole loop in one sentence
Extractors emit free-typed claims → the fabric puts every predicate and salient entity in vector space → the hybrid proposer nominates folds toward the more-established term → LLM adjudication types the relation against real usage → accepted alignments compile into the closure by atomic swap → donto_match_aligned folds synonyms and inverses at read time at the caller's confidence floor — while identity stays a per-hypothesis resolution and contradictions stay live, linked, and measured rather than resolved. See it feed real answers in how it solves things.