docs / evidence

The evidence chain

donto is evidence-first: a claim is not just a triple, it is a triple plus a resolvable pointer to the exact place in a source where it was (or wasn't) stated. From any statement UUID you can walk to a quoted snippet, its character offsets, the full source text, and the tamper-evident canonical bytes of the original resource.

evidence links 2.59Mspans 1.0M+documents 66,440revisions 68,378blobs 50,606 (6.0 GB)statements with evidence 1.58M

The chain at a glance

donto_statement                     the claim (bitemporal, paraconsistent)
      │ statement_id
      ▼
donto_evidence_link                 typed link: extracted_from / anchored_at / derived_from / …
      │ target_span_id
      ▼
donto_span                          char offsets + verbatim surface_text (the snippet)
      │ revision_id
      ▼
donto_document_revision             immutable content snapshot (body, SHA-256 content_hash)
      │ document_id          │ blob_hash
      ▼                      ▼
donto_document               donto_blob          stable source identity · content-addressed bytes
(IRI, source_url, policy)    (SHA-256 PK, local-FS or GCS)

Two rules shape everything here

Every layer is insert-only (invariant I3): evidence links are bitemporally retracted, blobs are tombstoned, nothing is row-deleted. And an anchorless claim is never rejected (invariant I1): it is asserted with its maturity capped at E1 — honesty about what is stated vs interpreted, not gatekeeping.

donto_evidence_link — the typed edge

donto_evidence_link (2.59M rows)

column	type	meaning
statement_id	uuid FK	The claim being evidenced.
link_type	text	`extracted_from` (write-time anchor) · `anchored_at` (the 1.78M-link legacy backfill) · `derived_from` (statement→statement, e.g. a claim derived from a memory chunk) · `produced_by` (→ extraction run) · `cited_in / supported_by / contradicted_by`.
target_*	uuid (6 columns)	Exactly ONE of: span, revision, document, annotation, extraction run, or another statement — enforced by a check constraint. Spans are by far the most common target.
confidence	float8	The extractor's or citer's confidence in this anchor.
tx_time	tstzrange	Bitemporal like statements — retracting an evidence link closes the range, never deletes.

donto_span — the universal 'place in a source'

A span pins a region of an immutable revision (never the mutable document): span_type (live data: 100% char_offset), start_offset / end_offset, and surface_text — the verbatim quoted snippet, trigram-indexed for substring search. Spans are the anchor primitive for far more than evidence: mentions, annotations, tables, temporal expressions, and document sections all FK onto span_id.

Beyond text — the anchor-kind registry

Char spans are today's workhorse, but donto_anchor_kind declares 13 anchor vocabularies with per-kind locator schemas, so non-text sources anchor the same way: page_box (PDF), image_box, media_time (audio/video), table_cell, csv_row, json_pointer, xml_xpath, html_css, token_range, annotation_id, archive_field, whole_source.

Documents, revisions, blobs

the three storage layers

column	type	meaning
donto_document	stable identity	The source's IRI, `media_type`, `source_url`, `source_kind` (pdf/image/audio/webpage/archive_record/…), EDTF-shaped `source_date` (uncertainty-aware, e.g. `{"value":"1860..1862"}`), and a mandatory `policy_id` — every document is policy-governed with a fail-closed default (`restricted_pending_review`).
donto_document_revision	immutable snapshot	The actual content: `body` text (FTS-indexed), `content_hash` (SHA-256 — re-adding identical content returns the existing revision), `version_kind` (raw/ocr/transcript/parsed/translated/redacted), revision lineage (`derived_from_versions`), and the blob link (`blob_hash`, `body_uri`, `body_storage`).
donto_blob	canonical bytes	Content-addressed: `sha256` IS the primary key — one row per unique byte sequence, automatic dedup. `bucket_uri` points at local FS (`file:///mnt/donto-data/blobs/sha256/…`) or GCS. Plus the tombstone columns (below).

Tombstoning — how 'never delete' meets 'right to be forgotten'

Invariant I3 says never destroy; GDPR and Indigenous cultural-material protocols sometimes require true deletion. donto splits the two: the content is stored encrypted with an external key reference (encryption_key_iri — donto holds no key material); tombstoning drops the key reference and zeroes the backend bytes, making the content cryptographically unreachable. The fact of deletion — who, when, under what authority (tombstone_authority, e.g. GDPR-Art-17 or community-resolution-7), citing which donto_attestation — stays queryable forever. Evidence links to tombstoned blobs keep existing; reads return a redaction marker.

The citer — deciding stated vs interpreted

Extraction and anchoring are separate stages. Extractors maximize recall; then a mandatory post-processing citer decides for every fact: where exactly was this stated — or honestly, nowhere. A wrong span is treated as worse than none; the citer never emits a bogus anchor.

the citer's routing (structural, never a hand-maintained list)

column	type	meaning
content lane	literal objects	Flexible substring matching (whitespace/case-tolerant). ~81% of literal facts anchor lexically, with zero bogus spans on the calibration corpus.
relational lane	IRI objects	Three stacked gates: (1) co-location — the window must contain the subject's distinguishing tokens AND the object's (distinguishing = IDF learned from the document's own entities); (2) structural exclusion of titles/dense headers; (3) a predicate-direction check — embedding cosine between the window and the humanized claim must clear a data-calibrated threshold.
semantic layer	unplaced facts	bge-small cosine argmax over source windows, calibrated threshold + margin guard.
unanchorable	the honest bucket	Everything else: `anchor=None, hypothesis_only=true`, confidence ≤ 0.4. Never dropped, never given a plausible-looking neighbour span. This is what separates STATED from INTERPRETED — and doubles as a hallucination filter.

A real chain, resolved live

One genealogy claim, walked end to end on the live database — four indexed joins from a statement UUID to tamper-evident source bytes:

statement   677ab36e-…  "4th cousin or half 3rd cousin 1x removed"   (ctx:genes/lisa-raquel)
   ▼ evidence_link  anchored_at, confidence 0.4
span        f795c9e8-…  char_offset 55→95, surface_text = the exact phrase
   ▼ revision_id
revision    6dc0bd52-…  115,034 bytes; substring(body from 56 for 40) == surface_text ✓
   ▼ document_id / blob_hash
document    donto:blob/sha256/fff3242c…  (an Ancestry DNA-match API capture, application/json)
blob        sha256 fff3242c…  — file on disk verified: sha256sum(file) == content address ✓

The same shape works for a current production row: a BEAM-10M claim (ex:large-volume isChallenge true, asserted at E2) resolves through an extracted_from link at confidence 1.0 to the span "2 million documents" at offsets 465–484 of its source chunk. 437K+ evidence links live under the BEAM contexts alone.

Extraction bookkeeping

column	type	meaning
donto_extract_queue	360,899 rows	The work queue donto-agent leases from (`FOR UPDATE SKIP LOCKED`): 339K pending / 21K done at time of writing — the claims-coverage frontier.
donto_extraction_run	93 rows	One row per extraction run: model identity, version, parameters — what produced_by links point at.
donto_trace_log	2.5M rows	The legacy re-anchoring backfill's audit trail (matching textSpan literals to real spans across revisions).

← previous

The claim model

Embeddings