dontodocs
docs / evidence

The evidence chain

donto is evidence-first: a claim is not just a triple, it is a triple plus a resolvable pointer to the exact place in a source where it was (or wasn't) stated. From any statement UUID you can walk to a quoted snippet, its character offsets, the full source text, and the tamper-evident canonical bytes of the original resource.

evidence links 2.59Mspans 1.0M+documents 66,440revisions 68,378blobs 50,606 (6.0 GB)statements with evidence 1.58M

The chain at a glance

donto_statement                     the claim (bitemporal, paraconsistent)
      │ statement_id
      ▼
donto_evidence_link                 typed link: extracted_from / anchored_at / derived_from / …
      │ target_span_id
      ▼
donto_span                          char offsets + verbatim surface_text (the snippet)
      │ revision_id
      ▼
donto_document_revision             immutable content snapshot (body, SHA-256 content_hash)
      │ document_id          │ blob_hash
      ▼                      ▼
donto_document               donto_blob          stable source identity · content-addressed bytes
(IRI, source_url, policy)    (SHA-256 PK, local-FS or GCS)
Two rules shape everything here
Every layer is insert-only (invariant I3): evidence links are bitemporally retracted, blobs are tombstoned, nothing is row-deleted. And an anchorless claim is never rejected (invariant I1): it is asserted with its maturity capped at E1 — honesty about what is stated vs interpreted, not gatekeeping.

donto_span — the universal 'place in a source'

A span pins a region of an immutable revision (never the mutable document): span_type (live data: 100% char_offset), start_offset / end_offset, and surface_text — the verbatim quoted snippet, trigram-indexed for substring search. Spans are the anchor primitive for far more than evidence: mentions, annotations, tables, temporal expressions, and document sections all FK onto span_id.

Beyond text — the anchor-kind registry

Char spans are today's workhorse, but donto_anchor_kind declares 13 anchor vocabularies with per-kind locator schemas, so non-text sources anchor the same way: page_box (PDF), image_box, media_time (audio/video), table_cell, csv_row, json_pointer, xml_xpath, html_css, token_range, annotation_id, archive_field, whole_source.

Documents, revisions, blobs

the three storage layers
columntypemeaning
donto_documentstable identityThe source's IRI, media_type, source_url, source_kind (pdf/image/audio/webpage/archive_record/…), EDTF-shaped source_date (uncertainty-aware, e.g. {"value":"1860..1862"}), and a mandatory policy_id — every document is policy-governed with a fail-closed default (restricted_pending_review).
donto_document_revisionimmutable snapshotThe actual content: body text (FTS-indexed), content_hash (SHA-256 — re-adding identical content returns the existing revision), version_kind (raw/ocr/transcript/parsed/translated/redacted), revision lineage (derived_from_versions), and the blob link (blob_hash, body_uri, body_storage).
donto_blobcanonical bytesContent-addressed: sha256 IS the primary key — one row per unique byte sequence, automatic dedup. bucket_uri points at local FS (file:///mnt/donto-data/blobs/sha256/…) or GCS. Plus the tombstone columns (below).

Tombstoning — how 'never delete' meets 'right to be forgotten'

Invariant I3 says never destroy; GDPR and Indigenous cultural-material protocols sometimes require true deletion. donto splits the two: the content is stored encrypted with an external key reference (encryption_key_iri — donto holds no key material); tombstoning drops the key reference and zeroes the backend bytes, making the content cryptographically unreachable. The fact of deletion — who, when, under what authority (tombstone_authority, e.g. GDPR-Art-17 or community-resolution-7), citing which donto_attestation — stays queryable forever. Evidence links to tombstoned blobs keep existing; reads return a redaction marker.

The citer — deciding stated vs interpreted

Extraction and anchoring are separate stages. Extractors maximize recall; then a mandatory post-processing citer decides for every fact: where exactly was this stated — or honestly, nowhere. A wrong span is treated as worse than none; the citer never emits a bogus anchor.

the citer's routing (structural, never a hand-maintained list)
columntypemeaning
content laneliteral objectsFlexible substring matching (whitespace/case-tolerant). ~81% of literal facts anchor lexically, with zero bogus spans on the calibration corpus.
relational laneIRI objectsThree stacked gates: (1) co-location — the window must contain the subject's distinguishing tokens AND the object's (distinguishing = IDF learned from the document's own entities); (2) structural exclusion of titles/dense headers; (3) a predicate-direction check — embedding cosine between the window and the humanized claim must clear a data-calibrated threshold.
semantic layerunplaced factsbge-small cosine argmax over source windows, calibrated threshold + margin guard.
unanchorablethe honest bucketEverything else: anchor=None, hypothesis_only=true, confidence ≤ 0.4. Never dropped, never given a plausible-looking neighbour span. This is what separates STATED from INTERPRETED — and doubles as a hallucination filter.

A real chain, resolved live

One genealogy claim, walked end to end on the live database — four indexed joins from a statement UUID to tamper-evident source bytes:

statement   677ab36e-…  "4th cousin or half 3rd cousin 1x removed"   (ctx:genes/lisa-raquel)
   ▼ evidence_link  anchored_at, confidence 0.4
span        f795c9e8-…  char_offset 55→95, surface_text = the exact phrase
   ▼ revision_id
revision    6dc0bd52-…  115,034 bytes; substring(body from 56 for 40) == surface_text ✓
   ▼ document_id / blob_hash
document    donto:blob/sha256/fff3242c…  (an Ancestry DNA-match API capture, application/json)
blob        sha256 fff3242c…  — file on disk verified: sha256sum(file) == content address ✓

The same shape works for a current production row: a BEAM-10M claim (ex:large-volume isChallenge true, asserted at E2) resolves through an extracted_from link at confidence 1.0 to the span "2 million documents" at offsets 465–484 of its source chunk. 437K+ evidence links live under the BEAM contexts alone.

Extraction bookkeeping

columntypemeaning
donto_extract_queue360,899 rowsThe work queue donto-agent leases from (FOR UPDATE SKIP LOCKED): 339K pending / 21K done at time of writing — the claims-coverage frontier.
donto_extraction_run93 rowsOne row per extraction run: model identity, version, parameters — what produced_by links point at.
donto_trace_log2.5M rowsThe legacy re-anchoring backfill's audit trail (matching textSpan literals to real spans across revisions).