docs / alignment & identity

Truth at query time

Extractors freely mint predicates (bornIn, wasBornIn, birthplaceOf…) and entity IRIs. Nothing is deduplicated at write time. Four cooperating subsystems reconcile the abundance later — reversibly, and as data: a continuous daemon, a predicate closure that queries fold through, identity as a per-hypothesis resolution, and a contradiction layer where both sides stay live.

alignments accepted 18,983closure rows 1.01M (~9.8K folds)identity proposals 221identity hypotheses 13argument edges 2,433contestedness windows 140K

The continuous alignment daemon

donto-align-daemon.service runs a self-pacing tick (~10min): embed missing predicates + entities (richest tier first) → propose alignment candidates (cheap, no LLM) → adjudicate small LLM batches (cap-aware, on the flat GLM lane) → periodically rebuild the closure + identity clusters → record a heartbeat and an append-only tick row.

what makes it rock-solid

column	type	meaning
single instance, twice over	flock + advisory lock	An exclusive flock plus a Postgres session advisory lock — even a deleted lock file can't yield two daemons.
load-aware	skips when busy	Checks system load, active backends, and whether an extraction is running; records load_skip and yields rather than competing.
cap-aware	structured backoff	LLM-quota caps are signalled by dedicated exit codes — never by substring-scanning logs (a confidence value containing '429' once false-tripped that).
DB-truth rebuild trigger	no in-memory counters	Pending-accept counts come from committed rows, so accepts from before a crash are never stranded.
I3-safe	additive only	Writes embeddings, candidates, reversible proposals, derived caches, and its own telemetry. Never touches donto_statement.

Predicate alignment → closure → folding

Propose: hybrid lexical OR semantic

A candidate qualifies by clearing either the trigram bar (morphological variants: bornIn/wasBornIn) or the embedding-cosine bar (true synonyms with no shared characters: killedBy/murderedBy). The target must be more popular than the source — the rare freshly-minted variant aligns to the established term (e.g. rdfType at 6.7K uses → rdf:type at 4.1M), which also kills cycles. Production floors: semantic ≥ 0.82, combined ≥ 0.88.

Adjudicate: an LLM types the relation, grounded in usage

Similarity can say two predicates are related; it cannot say the relation type — direction, containment, or that near-identical strings are not equivalent. The adjudicator shows the LLM each predicate's labels plus up to 6 real (subject, object) usage pairs from the substrate, and records a verdict from a closed vocabulary: exact_equivalent · inverse_equivalent · sub_property_of · close_match · not_equivalent. Verdicts below the 0.80 floor stay candidates; explicit negatives are recorded too, so look-alikes are never re-proposed. The full audit trail (similarities, generator, the model's reasoning) lives in the alignment row's provenance.

The ledger and its safety flags

donto_predicate_alignment (24,228 rows; bitemporal, append-only) carries three independent safety flags: safe_for_query_expansion (default true), safe_for_export (false), safe_for_logical_inference (false) — an alignment good enough to widen recall is not automatically good enough to export or reason over. Only accepted rows reach the closure.

The closure: a flat table queries can join

donto_predicate_closure (1.01M rows = one self row per predicate + ~9.8K real expansion edges) is rebuilt atomically — staged in a temp table, then swapped in one transaction, so readers never see a half-built closure. Each row says: a query for predicate A should also match statements stored under B, via this relation, at this confidence, swapping subject/object if it's an inverse.

Query-time folding — run live

donto_match_aligned() is the standard matcher plus closure expansion. A real fold, executed on the live database: ex:mrs-e-e-brackenridge has her birthplace stored under the freely-minted wasBornIn; querying the established bornIn returns both — nobody maintains a synonym table:

SELECT subject, predicate, object_iri, matched_via, alignment_confidence
FROM donto_match_aligned(p_subject := 'ex:mrs-e-e-brackenridge', p_predicate := 'bornIn');

 ex:mrs-e-e-brackenridge | bornIn    | ex:adelaide | direct           | 1.00
 ex:mrs-e-e-brackenridge | wasBornIn | ex:adelaide | exact_equivalent | 0.95

And the inverse-swap branch — asking a question in the opposite orientation to how the fact was stored:

-- stored: ex:robert-dawson killedBy ex:unruly-horse
SELECT * FROM donto_match_aligned(p_subject := 'ex:unruly-horse', p_predicate := 'killed');

 ex:unruly-horse | killedBy | ex:robert-dawson | inverse_equivalent | 0.95
 -- subject/object swapped back so the row reads in the caller's orientation

Other live folds: birthplaceOf ↔ bornIn (inverse), diedOf ↔ causeOfDeath (inverse), killed → murdered (sub-property: querying the general term also returns the specific), affiliated-with ↔ affiliatedWith. Alignment widens matching, never visibility — consumers still post-filter by their own scope.

Identity as hypothesis — never a merge

donto never merges entities. "Same referent" is data: a reversible proposal → a governed pairwise edge → a cluster cache keyed by hypothesis. Which entities exist depends on which identity policy you query under.

the identity stack

column	type	meaning
donto_identity_proposal	221 rows	The reversible front door: `same_as, different_from, merge_candidate, split_candidate, alias_of…` with method (human/rule/model/registry/cross-source), confidence, and a status history.
donto_identity_edge	124 rows	The asserted pairwise layer: `same_referent · possibly_same_referent · distinct_referent · not_enough_information`. Bitemporal; retraction closes the range.
donto_identity_hypothesis	13 rows	Named identity policies with clustering thresholds — live: `strict` (0.98), `likely` (0.85), `exploratory` (0.60), plus human curation hypotheses for specific genealogy disambiguations.
donto_identity_cluster_cache	1,026 rows	The derived per-hypothesis resolution: connected components over same_referent edges above the hypothesis's threshold; rep = min symbol id; invalidated by trigger on any edge change.

A real identity, resolved three ways (live)

Discord ingestion minted both ex:traves-theberge and ex:traves_theberge (hyphen vs underscore). Fingerprint embeddings put the pair at cosine 0.84; LLM adjudication judged same_referent @ 0.95; the proposal was accepted and became identity edge #124. Resolving the underscore IRI under each hypothesis:

SELECT h.name, h.threshold_same, donto_identity_resolve_iri(h.hypothesis_id, 'ex:traves_theberge');

 strict      | 0.98 | ex:traves_theberge   -- 0.95 < 0.98: NOT merged under strict
 likely      | 0.85 | ex:traves-theberge   -- clustered: resolves to the representative
 exploratory | 0.60 | ex:traves-theberge

That is identity-as-hypothesis, operationalized

The same edge set yields different entity universes under different policies. No statement was rewritten; retracting the edge (or querying strict) restores two entities.

The contradiction machinery

Paraconsistency is three mechanics, none of which delete anything:

column	type	meaning
polarity is data	not deletion	A claim and its negation are two live rows — flags carry asserted/negated/absent/unknown.
conflict is an edge	not an invalidation	`donto_argument` (2,433 rows; 2,225 `rebuts`) links incompatible claims with typed edges: `supports, rebuts, undercuts, qualifies, explains, alternative_analysis_of, same_evidence_different_analysis, supersedes…` Both sides keep matching queries.
re-ranking	not resolution	`donto_paraconsistency_density` (140,675 rows) pre-aggregates contestedness per subject/predicate window — a Shannon-entropy conflict score — so read paths can rank by it without an O(N²) scan. Retraction exists, but it's an explicit governance act, never an automatic consequence of conflict.

Where the rebuts edges come from

An epistemic sweep marks genealogy predicates functional (one true value: ex:birthYear, ex:gender, ex:birthPlace…), finds subjects with multiple distinct asserted values, and creates rebuts pairs in a dedicated context — bounded scans, idempotent inserts. Some detected conflicts (gender female vs Female vs F) are really value-normalization gaps — exactly what the value-mapping and literal-canonicalization tables are built for as they come online.

SELECT predicate, total_score, windows
FROM donto_v_top_contested_predicates ORDER BY total_score DESC LIMIT 5;

 rdf:type                  | 49715.9 | 53713
 rdfs:label                | 14155.4 | 16182
 rdfType                   |  1774.2 |  1999
 locatedIn                 |  1246.2 |  1367
 interrogationInterrogator |   842.6 |   846

Contestedness is the steering wheel: it tells the discovery lenses where reality is disputed and worth another look.

The whole loop in one sentence

Extractors emit free-typed claims → the fabric puts every predicate and salient entity in vector space → the hybrid proposer nominates folds toward the more-established term → LLM adjudication types the relation against real usage → accepted alignments compile into the closure by atomic swap → donto_match_aligned folds synonyms and inverses at read time at the caller's confidence floor — while identity stays a per-hypothesis resolution and contradictions stay live, linked, and measured rather than resolved. See it feed real answers in how it solves things.

← previous

Embeddings

How it solves things