The evidence chain
donto is evidence-first: a claim is not just a triple, it is a triple plus a resolvable pointer to the exact place in a source where it was (or wasn't) stated. From any statement UUID you can walk to a quoted snippet, its character offsets, the full source text, and the tamper-evident canonical bytes of the original resource.
The chain at a glance
donto_statement the claim (bitemporal, paraconsistent)
│ statement_id
▼
donto_evidence_link typed link: extracted_from / anchored_at / derived_from / …
│ target_span_id
▼
donto_span char offsets + verbatim surface_text (the snippet)
│ revision_id
▼
donto_document_revision immutable content snapshot (body, SHA-256 content_hash)
│ document_id │ blob_hash
▼ ▼
donto_document donto_blob stable source identity · content-addressed bytes
(IRI, source_url, policy) (SHA-256 PK, local-FS or GCS)donto_evidence_link — the typed edge
| column | type | meaning |
|---|---|---|
| statement_id | uuid FK | The claim being evidenced. |
| link_type | text | extracted_from (write-time anchor) · anchored_at (the 1.78M-link legacy backfill) · derived_from (statement→statement, e.g. a claim derived from a memory chunk) · produced_by (→ extraction run) · cited_in / supported_by / contradicted_by. |
| target_* | uuid (6 columns) | Exactly ONE of: span, revision, document, annotation, extraction run, or another statement — enforced by a check constraint. Spans are by far the most common target. |
| confidence | float8 | The extractor's or citer's confidence in this anchor. |
| tx_time | tstzrange | Bitemporal like statements — retracting an evidence link closes the range, never deletes. |
donto_span — the universal 'place in a source'
A span pins a region of an immutable revision (never the mutable document): span_type (live data: 100% char_offset), start_offset / end_offset, and surface_text — the verbatim quoted snippet, trigram-indexed for substring search. Spans are the anchor primitive for far more than evidence: mentions, annotations, tables, temporal expressions, and document sections all FK onto span_id.
Beyond text — the anchor-kind registry
Char spans are today's workhorse, but donto_anchor_kind declares 13 anchor vocabularies with per-kind locator schemas, so non-text sources anchor the same way: page_box (PDF), image_box, media_time (audio/video), table_cell, csv_row, json_pointer, xml_xpath, html_css, token_range, annotation_id, archive_field, whole_source.
Documents, revisions, blobs
| column | type | meaning |
|---|---|---|
| donto_document | stable identity | The source's IRI, media_type, source_url, source_kind (pdf/image/audio/webpage/archive_record/…), EDTF-shaped source_date (uncertainty-aware, e.g. {"value":"1860..1862"}), and a mandatory policy_id — every document is policy-governed with a fail-closed default (restricted_pending_review). |
| donto_document_revision | immutable snapshot | The actual content: body text (FTS-indexed), content_hash (SHA-256 — re-adding identical content returns the existing revision), version_kind (raw/ocr/transcript/parsed/translated/redacted), revision lineage (derived_from_versions), and the blob link (blob_hash, body_uri, body_storage). |
| donto_blob | canonical bytes | Content-addressed: sha256 IS the primary key — one row per unique byte sequence, automatic dedup. bucket_uri points at local FS (file:///mnt/donto-data/blobs/sha256/…) or GCS. Plus the tombstone columns (below). |
Tombstoning — how 'never delete' meets 'right to be forgotten'
Invariant I3 says never destroy; GDPR and Indigenous cultural-material protocols sometimes require true deletion. donto splits the two: the content is stored encrypted with an external key reference (encryption_key_iri — donto holds no key material); tombstoning drops the key reference and zeroes the backend bytes, making the content cryptographically unreachable. The fact of deletion — who, when, under what authority (tombstone_authority, e.g. GDPR-Art-17 or community-resolution-7), citing which donto_attestation — stays queryable forever. Evidence links to tombstoned blobs keep existing; reads return a redaction marker.
The citer — deciding stated vs interpreted
Extraction and anchoring are separate stages. Extractors maximize recall; then a mandatory post-processing citer decides for every fact: where exactly was this stated — or honestly, nowhere. A wrong span is treated as worse than none; the citer never emits a bogus anchor.
| column | type | meaning |
|---|---|---|
| content lane | literal objects | Flexible substring matching (whitespace/case-tolerant). ~81% of literal facts anchor lexically, with zero bogus spans on the calibration corpus. |
| relational lane | IRI objects | Three stacked gates: (1) co-location — the window must contain the subject's distinguishing tokens AND the object's (distinguishing = IDF learned from the document's own entities); (2) structural exclusion of titles/dense headers; (3) a predicate-direction check — embedding cosine between the window and the humanized claim must clear a data-calibrated threshold. |
| semantic layer | unplaced facts | bge-small cosine argmax over source windows, calibrated threshold + margin guard. |
| unanchorable | the honest bucket | Everything else: anchor=None, hypothesis_only=true, confidence ≤ 0.4. Never dropped, never given a plausible-looking neighbour span. This is what separates STATED from INTERPRETED — and doubles as a hallucination filter. |
A real chain, resolved live
One genealogy claim, walked end to end on the live database — four indexed joins from a statement UUID to tamper-evident source bytes:
statement 677ab36e-… "4th cousin or half 3rd cousin 1x removed" (ctx:genes/lisa-raquel)
▼ evidence_link anchored_at, confidence 0.4
span f795c9e8-… char_offset 55→95, surface_text = the exact phrase
▼ revision_id
revision 6dc0bd52-… 115,034 bytes; substring(body from 56 for 40) == surface_text ✓
▼ document_id / blob_hash
document donto:blob/sha256/fff3242c… (an Ancestry DNA-match API capture, application/json)
blob sha256 fff3242c… — file on disk verified: sha256sum(file) == content address ✓The same shape works for a current production row: a BEAM-10M claim (ex:large-volume isChallenge true, asserted at E2) resolves through an extracted_from link at confidence 1.0 to the span "2 million documents" at offsets 465–484 of its source chunk. 437K+ evidence links live under the BEAM contexts alone.
Extraction bookkeeping
| column | type | meaning |
|---|---|---|
| donto_extract_queue | 360,899 rows | The work queue donto-agent leases from (FOR UPDATE SKIP LOCKED): 339K pending / 21K done at time of writing — the claims-coverage frontier. |
| donto_extraction_run | 93 rows | One row per extraction run: model identity, version, parameters — what produced_by links point at. |
| donto_trace_log | 2.5M rows | The legacy re-anchoring backfill's audit trail (matching textSpan literals to real spans across revisions). |