Handbook · Chapter 12 of 12 · 18 min read

Trust but verify: validators, diagnostics, and the lesson catalog

An atomic distro's promise — "the image you tested is the image you run" — is only as good as the testing. Margine validates at three altitudes: build time (CI file checks inside the candidate image, chapter 9), boot time (QEMU smoke gate before :stable promotion, chapter 9), and runtime (the margine-validate-* suite on the deployed system). This chapter covers the runtime layer, the diagnostics bundle, the manual QEMU/ISO workflow, and the catalog of real bugs the project hit — each reduced to symptom → root cause → fix → generalized rule.

12.1 The margine-validate-* suite

Design principles

The validators live in margine-fedora-atomic/scripts/ and are baked into the image as /usr/bin/margine-validate-*. Three deliberate constraints:

  • Read-only. A validator never mutates state. Repair is the job of configure-* scripts and ujust recipes.
  • On-demand, not hooks. They are not pre/post-update hooks (updates.validators_on_demand in margine-atomic.yaml). An update gate that can wedge updates is worse than drift.
  • warn vs fail discipline. Only hard contract violations fail (exit 1); environment-dependent findings warn and keep exit 0. set -u; set -o pipefail but no set -e — a probe returning non-zero is data, not a crash.
warn() {
  warnings=$((warnings + 1))
  printf 'WARN: %s\n' "$1"
}

fail() {
  failures=$((failures + 1))
  printf 'FAIL: %s\n' "$1"
}

margine-fedora-atomic/scripts/validate-atomic-layout:16-24 — shared counter pattern across the whole suite; the summary section exits 1 only if failures > 0.

How they get into the image

The build fetches them from the spec repo at a pinned ref and installs them under a margine- prefix:

for s in \
    ... \
    validate-atomic-layout \
    validate-cachyos-kernel \
    validate-hardware-media-stack \
    validate-gaming-runtime \
    validate-margine-system \
    validate-declared-state \
    collect-diagnostics ; do
  retry_curl "${MARGINE_REPO}/${MARGINE_REF}/scripts/${s}" \
             "/usr/bin/margine-${s}"
  chmod 0755 "/usr/bin/margine-${s}"
done

margine-image/build_files/40-spec-scripts/install.sh:37-57 — preceded by a preflight curl --head against the repo: a 404 fails the build loudly instead of shipping an image with silently-missing tooling.

validate-atomic-layout

Checks the ostree/bootc contract: rpm-ostree status works, / is mounted ro, /home → /var/home, btrfs backs the layout, Secure Boot state, LUKS2/TPM2 enrollment in /etc/crypttab. One subtlety worth stealing — on composefs systems /usr has no separate mountpoint, and a naive "is /usr ro?" check false-positives:

# On Silverblue with composefs (Fedora 39+), /usr is embedded in the root
# overlay and has no separate mountpoint. This is expected and correct.
if findmnt /usr >/dev/null 2>&1; then
  ...
else
  root_fstype_inner=$(mount_field FSTYPE /)
  if [[ "$root_fstype_inner" == "overlay" ]]; then
    ok "/usr is embedded in the composefs root overlay (expected on Silverblue)"

margine-fedora-atomic/scripts/validate-atomic-layout:111-122

validate-cachyos-kernel

Confirms the kernel replacement (chapter 3) actually took: uname -a matches cachy, CachyOS RPMs present, COPR repo file installed, and — the check that catches half-applied deployments — stock Fedora kernel packages still visible are a warn:

if printf '%s\n' "$kernel" | grep -Eiq 'cachy|cachyos'; then
  ok "running kernel appears to be CachyOS"
else
  fail "running kernel does not appear to be CachyOS"
fi

margine-fedora-atomic/scripts/validate-cachyos-kernel:34-38

The signature/MOK side lives in validate-margine-system §4, which goes beyond --sb-state to verify the actual trust anchor:

if [[ "$SB_STATE" =~ enabled ]]; then
  ok "Secure Boot is enabled"
  if "${SUDO[@]}" mokutil --list-enrolled 2>/dev/null | grep -qiE 'margine|daniel'; then
    ok "Margine MOK is enrolled"
  else
    bad "Margine MOK NOT enrolled — CachyOS kernel will not load on next boot if SB stays on"
  fi
else
  warn "Secure Boot is disabled — running unsigned kernel without verification"
fi

margine-fedora-atomic/scripts/validate-margine-system:214-222 — "SB on but MOK missing" is the one state that bricks the next boot, hence the only bad.

validate-hardware-media-stack

The codec/GPU chapter (10-hardware-media-stack.md) made claims; this validator proves them per-machine: PipeWire/WirePlumber user services active, glxinfo -B, vulkaninfo --summary, vainfo, ffmpeg -hwaccels, gst-inspect-1.0 va, clinfo/rocminfo, and an application-level probe (darktable-cltest must report "OpenCL AVAILABLE and ENABLED" — testing the consumer, not just the ICD). Everything is warn-only: hardware varies, so the script is a structured report, not a gate. The run_if_present helper prints the exact command before running it, so the output doubles as a reproduction script.

validate-gaming-runtime

Checks the opt-in gaming layer: Flatpak launchers (flatpak info per app-id), host helpers (gamescope, mangohud, vkbasalt, gamemoded), Vulkan layer files in all three search paths, Steam's sandbox permissions (flatpak info --show-permissions), controller udev packages, and a policy check that persistent kernel.split_lock_mitigate=0 hasn't been smuggled into /etc/sysctl.d (validate-gaming-runtime:161-168).

validate-declared-state — making the YAML load-bearing

The drift detector compares declarations/margine-atomic.yaml against the running system: every declared host package rpm -q --whatprovides-resolvable, every declared Flatpak in flatpak list --system, every declared extension UUID present in the system or user extension dir. Spec lookup order makes the same script work in dev tree, CI, and on-host:

def find_spec() -> Path:
    env = os.environ.get("MARGINE_SPEC")
    ...
    candidates = [
        Path("/usr/declarations/margine-atomic.yaml"),
        Path("/usr/share/margine/declarations.yaml"),
        Path(__file__).resolve().parent.parent / "declarations" / "margine-atomic.yaml",
    ]

margine-fedora-atomic/scripts/validate-declared-state:105-127 — the /usr/declarations symlink is created by 40-spec-scripts/install.sh:72-73 because six of seven configure- scripts resolve the YAML relative to __file__, which from /usr/bin/ lands there.*

Flatpak absences are warn, not fail — Margine's DEFER queue (chapter 6) means a declared app may legitimately not be installed yet. Direction matters in a drift detector: "spec says X, system lacks X" and "system has X, spec never mentioned it" are different bugs; this tool currently surfaces the first.

validate-margine-system

The comprehensive runtime acceptance test: identity (VARIANT_ID=margine), kernel, MOK, branding assets, GNOME settings as actually applied ("photograph the current state" — GDM/Shell can shadow dconf defaults in ways file checks never see), Flatpaks, failed units. Expected values are hardcoded at the top with a comment ordering them "keep in sync with declarations/margine-atomic.yaml and build.sh" — see Lesson 10 for why that sync discipline is load-bearing.

12.2 margine-collect-diagnostics

When a validator fails on someone else's machine, you want one command that produces an attachable artifact. The collector runs ~60 captures into a timestamped directory and tars it:

capture() {
  local name=$1
  shift
  {
    printf '$'
    printf ' %q' "$@"
    printf '\n\n'
    "$@"
  } >"${out_dir}/${name}" 2>&1 || true
}

capture rpm-ostree-status-json.txt rpm-ostree status --json
capture journal-warnings.txt journalctl -b -p warning..alert --no-pager

margine-fedora-atomic/scripts/collect-diagnostics:14-23,39,52 — every file begins with the exact command that produced it; || true because a failing probe is itself a finding.

Coverage mirrors the validators (mounts, crypttab, MOK, media stack, gaming, GNOME interface keys, fonts, repo files, btrfs subvolumes) so a bundle can answer any validator's question offline. umask 077 and an explicit trailer warn that the archive contains hostnames, usernames, and journal excerpts — say this in the tool, not in docs nobody reads.

12.3 QEMU validation workflow for ISOs

CI's smoke gate (chapter 9) answers "does the qcow2 boot to GDM". Installer ISOs need a human: Anaconda flow, partitioning, MOK staging, first-boot UX. The lab workflow (docs/02b-lab-vm-setup.md) uses libvirt with real Secure Boot and a software TPM:

virt-install \
    --connect qemu:///system \
    --name margine-smoketest \
    --memory 8192 \
    --vcpus 4 \
    --disk size=64,format=qcow2 \
    --boot uefi,firmware.feature0.name=secure-boot,firmware.feature0.enabled=yes,loader.secure=yes \
    --tpm backend.type=emulator,backend.version=2.0,model=tpm-crb \
    --cdrom ~/data/inbox/10-downloads/bluefin-stable-x86_64.iso \
    --graphics spice \
    --network network=default \
    --noautoconsole

margine-fedora-atomic/docs/02b-lab-vm-setup.md:117-129 — OVMF secboot firmware + swtpm CRB device gives the full enrollment path: install → reboot → MokManager → enroll with the documented passphrase → CachyOS kernel boots under SB. virt-viewer --connect qemu:///system margine-smoketest opens the console.

Two session gotchas the doc captures: virsh defaults to qemu:///session (per-user) while the NAT default network lives in qemu:///system — so either export LIBVIRT_DEFAULT_URI=qemu:///system or pass --connect everywhere; and the snapshot ladder (margine-stable-<date> after each verified milestone) makes rollback the recovery default instead of reinstalling. Test ISOs reach the lab via Internet Archive test_collection items (auto-expire ~30 days, publish-titanoboa-test-iso.yml) because GHA artifact egress to residential lines runs at ~1-1.5 MB/s — an 8 GB ISO would take hours.

For the automated half, the smoke gate's hard-won detail is what to grep for. "Reached target multi-user.target" is not reliably emitted on serial consoles on recent systemd:

# Multi-marker approach (2026-06-01): systemd recent does NOT
# always emit "Reached target multi-user.target" verbatim on
# the serial console (seen on Fedora 44 with CachyOS kernel ...)
for i in $(seq 1 1200); do
  if grep -qE "Started.*gdm\.service|Reached target graphical\.target|margine login:" serial.log; then
    echo "✓ Boot reached usable state at second $i"

margine-image/.github/workflows/smoke-boot.yml:176-188 — accept any of three equivalent "userspace is up" signals; a single-marker gate produced false boot failures.

12.4 Lesson catalog

Every entry below is a real Margine incident with the fix in-tree. The generalized rules are the transferable part.

Lesson 1 — mksquashfs: everything after -e is an exclude

Symptom: Live ISOs larger and slower to boot than projected; the build log says Creating 4.0 filesystem ... gzip compressed despite -comp zstd -Xcompression-level 19 being passed. Root cause: mksquashfs treats every argument after the first -e as an exclude name. The invocation put the exclude list before the compressor flags, so -comp, zstd, -Xcompression-level, 19 were silently consumed as bogus excludes and the default gzip applied. Fix (upstreamed; recorded in the commit message):

fix: pass mksquashfs exclude list last so -comp zstd is honored

Moving the compressor options before -e (and passing both excludes to
a single -e, which has to be the last option per the man page) fixes
it. Tested with squashfs-tools 4.6.1: zstd is applied and sysroot and
ostree are still excluded.

margine-fedora-atomic commit 32afa48 (2026-06-10) — "every consumer is currently shipping gzip instead of the intended zstd-19." Rule: Greedy/positional CLI options invalidate "the flags are present, therefore they applied" reasoning. Assert outcomes in build logs (the compressor line, the final size), not invocations.

Lesson 2 — [ -f ] on a directory: the hybrid ISO that wasn't

Symptom: ISOs advertised as hybrid BIOS+UEFI never boot on BIOS; nothing in the build fails. Root cause: Titanoboa's build_iso.sh:32 guards the BIOS GRUB module copy with [ -f ] against /usr/lib/grub/i386-pc — a directory, so the test is always false and the copy never runs; its xorriso call also lacks an El Torito -b image. Found by a build-log scan, not by a failure. Fix: Margine ships the truth instead of the claim:

if [[ -d /usr/lib/grub/i386-pc ]]; then
  echo "NOTE: /usr/lib/grub/i386-pc present, but current Titanoboa produces a UEFI-only ISO (no BIOS El Torito; upstream build_iso.sh:32 -f-vs-directory bug)"

margine-image/live-env/src/build.sh:61-62 — BIOS stays non-gating per ADR-0008 §4 (all reference hardware is UEFI). Rule: A wrong file-test operator fails silently in guard position. Read your vendored dependencies' build logs once, end to end; every claim a pipeline makes ("hybrid", "compressed", "signed") needs one observable check.

Lesson 3 — ccache poisoning container builds

Symptom: Compiling wayland-scroll-factor inside the image build dies on every TU with ccache: error: File exists. Root cause: The Bluefin base puts /usr/lib64/ccache first in PATH, so cc is a ccache shim; in the build container ccache's cache dir isn't writable and every compile aborts. Fix:

# Margine's PATH puts /usr/lib64/ccache first; in the build container
# ccache's cache dir isn't writable and every compile dies with
# "ccache: error: File exists" ... Compile without it — a one-shot
# build gains nothing from a compiler cache anyway.
export CCACHE_DISABLE=1

margine-image/build_files/45-wsf/install.sh:36-41 Rule: Deriving from an opinionated base image means inheriting its developer-experience knobs in a context they were never tested in. Neutralize host-oriented toolchain shims (ccache, sccache, interactive PATH injection) in build sections; a one-shot layer build gains nothing from them.

Lesson 4 — SELinux xattrs vs rsync in kickstart %post

Symptom: BAKE Flatpaks present on disk after install but fail to launch — AVC denials in the journal. Root cause: ostree/bootc reset /var per deployment, so baked Flatpaks must be rsync'd from the installer rootfs into the target (%post --nochroot). A naive rsync -a drops POSIX xattrs; copying SELinux labels verbatim from the installer context is also wrong, because the target's labels belong to ostree-finalize. Fix: the original BIB kickstart preserved xattrs/ACLs/hardlinks:

rsync -aAXUHK --open-noatime /var/lib/flatpak "$DEPLOY_DIR/var/lib/"

(That BIB kickstart, iso-gnome.toml, has since been deleted.) The Titanoboa migration then pinned the production-verified refinement as the standing invariant in live-env/src/anaconda/post-scripts/install-flatpaks.ks: rsync -aAXUHKP --filter='-x security.selinux' — "preserves POSIX xattrs but strips SELinux labels which ostree-finalize restores. Flatpak directories have system_data_t/flatpak_t labels; if dropped or wrong, Flatpaks fail to launch with AVC denials" (ADR-0008 §4). Belt-and-suspenders: every BAKE app is also in /usr/share/flatpak/preinstall.d/margine-defaults.preinstall, so a silently failed rsync still self-heals at first boot. Rule: On SELinux systems "copied the bytes" ≠ "copied the file". Decide explicitly, per metadata class, whether to preserve or strip — and let the component that owns labeling (ostree-finalize, restorecon) do its job. Always pair an install-time copy with a first-boot fallback.

Lesson 5 — Clutter 18 unrealize assert: hide before detach

Symptom: Launching an app from the search-light overlay SIGABRTs the entire gnome-shell — on Wayland the session dies, and GNOME's crash protection then sets disable-user-extensions=true, knocking out all extensions. Root cause: extension.js _release_ui() calls remove_child() on the entry while the overlay is still mapped; Clutter 18's stricter unrealize path asserts !clutter_actor_is_mapped(self) and aborts. Verified by coredump + journal on the reference host; upstream had open reports and no fix. Fix: build-time patch of the baked extension — unmap before detaching:

new = """  _release_ui() {
    if (this._entry) {
      if (this._entry.get_parent()) {
        this._entry.hide(); // margine: unmap before detach (Clutter 18 unrealize assert)
        this._entry.get_parent().remove_child(this._entry);"""

margine-image/build_files/build-margine-extensions.sh:203-220 — applied via exact-match string replace, idempotent (greps for its own marker first), and soft-fail: if upstream's code changes the patch logs a WARN instead of failing the build, because a mitigation must not become load-bearing. Rule: Shell extensions are in-process patches to a moving target; when you bake them, you own their crashes. Detach-while-mapped is the canonical GNOME-major-bump breakage: hide/unmap actors before remove_child(). Build-time source patches beat forks for one-liners — but make them idempotent and soft-failing.

Lesson 6 — dconf list replacement shadows distro keybindings

Symptom: Super+period (Smile emoji picker) works on a fresh install, then goes dead forever after the first ujust margine-bootstrap. Root cause: The image ships the binding at the distro dconf layer (/etc/dconf/db/distro.d/07-margine-custom-keybindings). configure-gnome-keybindings then writes the custom-keybindings path list at the user layer — and dconf lists replace, they don't merge. The user-layer list, which didn't contain the smile slot, shadowed the distro entry wholesale:

def apply_custom(custom_list: list[dict], dry: bool) -> None:
    paths = [f"{CUSTOM_BASE_PATH}{c['name']}/" for c in custom_list]
    run(["gsettings", "set", CUSTOM_LIST_SCHEMA, "custom-keybindings",
         gvariant_strings(paths)], dry)

margine-fedora-atomic/scripts/configure-gnome-keybindings:275-287 — REPLACES the whole list. Fix: declare the slot in the spec so bootstrap recreates it (commit 4ce4722):

- name: smile
  # NB: bootstrap REPLACES the whole custom-keybindings path list,
  # which used to shadow the distro-level margine-smile entry ...
  binding: "<Super>period"
  command: "flatpak run it.mijorus.smile"

margine-fedora-atomic/declarations/margine-atomic.yaml (keybindings.custom) Rule: dconf layering is per-key, and a list is one key. Any tool that writes a list key at the user layer silently shadows every distro-layer element not in its input. Either own the full list in one place (Margine's choice: the spec) or read-merge-write — never blind-set.

Lesson 7 — dynamic workspaces make move-to-workspace-N a silent no-op

Symptom: SUPER+SHIFT+N (move window to workspace N) "feels broken" — sometimes works, usually does nothing, no error anywhere. Root cause: With GNOME dynamic workspaces, the native move-to-workspace-N and switch-to-workspace-N bindings only act on workspaces that already exist; they do not create workspace N the way Hyprland does. Margine binds Super+1..0 for Hyprland muscle memory, so most targets didn't exist yet. Fix: static workspace model, pre-created (commit 4ce4722, count later tuned 10→5 in 32afa48):

workspaces:
  dynamic: false
  # 5, not 10: static workspaces are all pre-created and always visible
  # in the overview/pager, and a permanent wall of 10 felt like clutter.
  # SUPER+[SHIFT+]6..0 bindings stay declared — harmless no-ops until
  # the count is raised again.
  count: 5
  names: ["1", "2", "3", "4", "5"]

margine-fedora-atomic/declarations/margine-atomic.yaml:795-803 — and validate-margine-system:470-481 asserts both num-workspaces and dynamic-workspaces=false ("SUPER+1..0 binds will misbehave"). Rule: Porting keybindings between WMs ports the keys, not the semantics. GNOME's numbered-workspace bindings presuppose static workspaces; flip org.gnome.mutter dynamic-workspaces off whenever you ship direct-jump binds, and validate the pair of settings, since either alone breaks the UX.

Symptom: Offline docs open in the (Flatpak) default browser; the page renders, but every link to a sibling page is dead. Root cause: When a Flatpak app receives a file:// URI outside its permissions, the document portal exports only that single file into the sandbox (/run/user/.../doc/...). The HTML arrives; its relative CSS/links point at siblings that were never exported. Worse, Flatpak reserves /usr — no override can ever expose the immutable seed copy. Fix: serve from a /var mirror the sandbox is granted read access to:

# Why ALWAYS the /var copy and never the /usr seed directly: the
# default browser is a Flatpak (Zen). For a file:// URI outside its
# permissions the portal exports ONLY that single file — the page
# renders but every relative link to sibling pages is dead. ...
# Flatpak reserves /usr so no override can ever expose the seed.
if [[ -f "${VAR_DIR}/docs/index.html" ]]; then
  exec xdg-open "file://${VAR_DIR}/docs/index.html"

margine-image/build_files/system_files/usr/libexec/margine/docs-open:9-27, paired with docs-refresh:66: flatpak override --system --filesystem="${DOCS_DIR}:ro" (global, not per-app, so a browser switch keeps working). Corollary in the same component: refreshing the mirror by directory-swap broke already-running sandboxes (they bind-mount the dir at app start and end up staring at the emptied old inode) — docs-refresh's sync_in() rsyncs files in place instead (docs-refresh:38-51, commit b9208eb). Rule: xdg-open file://... toward sandboxed apps exports one file, not a tree. Multi-file local content must live on a path the sandbox holds a --filesystem grant for — which excludes /usr by design — and must be updated file-wise, never by replacing the granted directory.

Lesson 9 — journald is the first victim of a host I/O stall

Symptom: Post-incident analysis of a build-host freeze finds the previous boot's journal ends 23 hours before the stall — zero hung-task, nvme, or zfs errors persisted. The postmortem cannot prove its own trigger hypothesis. Meanwhile HTTP uptime checks stayed green because the reverse proxy kept serving from page cache. Root cause: journald persists through the same I/O path that is stalling. It blocks (D state) or dies before the interesting kernel messages are written; everything after that exists only in a ring buffer that the power-cycle erases. Evidence collection and failure share a single point of failure. Fix/follow-ups from the incident note: ship kernel messages off-box (netconsole or remote syslog — "so the nvme/zfs messages survive the next stall; without it, every postmortem stays incomplete") and replace HTTP uptime checks with a write+fsync heartbeat probe (cron touches the pool and pings ntfy; silence = alarm). proxmox-pve1/docs/notes/2026-06-11-pve1-io-stall-power-cycle.md — fourth storage incident in a month on the DRAM-less single-NVMe ZFS host; the same class of failure that earlier killed Margine's self-hosted runner (chapter 9). Rule: Telemetry that shares a failure domain with the thing it observes will be lost exactly when you need it. For storage incidents: off-box kernel logging, and probes that exercise the write path (fsync), not the cached read path.

Lesson 10 — validator sentinels must track shipped defaults

Symptom: CI run 27297409457 fails the first-boot asset validator right after a correct fix landed: the search-light border-radius default was repaired from 30 to 7 (the key is an index 0-7 into a px table [0,16,18,20,22,24,28,32], not pixels — 30 hit rads[30] = undefined and the rounding was silently skipped), but the validator still asserted the old value. Root cause: The sentinel encodes a copy of the shipped default. Two copies of one fact, changed in one place. Fix: update the sentinel in lock-step (commit b4e8680), and make it carry its own rationale:

# search-light rounded-corners daniel default: border-radius=7.0
# (the value is an INDEX 0-7 into the extension's px table, not
# pixels — 7 = 32px max rounding; the old 30 was out of range and
# silently ignored. See #94.)
grep -qE "^border-radius=7" "$DCONF_DIR/02-margine-search-light" || { echo "::error::A.3.bis search-light border-radius!=7 — daniel default lost"; fail=1; }

margine-image/.github/workflows/build.yml (A.3.bis section) Rule: A sentinel's failure mode is blocking good builds, not missing bad ones — budget for that. Default and sentinel must change in the same commit (grep CI for the old value before merging any default change), and each sentinel should cite why the value is what it is, so the next person edits it instead of deleting it. The underlying extension bug carries its own rule: schema types lie; read the consumer of a key before assuming units.

12.5 Alternatives & other distros

Runtime validation / drift detection

  • Margine: bespoke read-only validate-* bash/python suite + YAML drift detector — cheap, transparent, zero dependencies beyond PyYAML.
  • Bluefin/Bazzite (ublue): minimal on-host validation; rely on bootc container lint at build, huge :testing user base as the de-facto detector, and ujust doctor-style recipes. Less machinery, more community.
  • NixOS: the configuration is the system closure — drift between declaration and system is impossible by construction (only mutable state can drift); the validator equivalent is nixos-rebuild dry-activate + the module test framework.
  • openSUSE MicroOS/Aeon: transactional-update + health-checker run real boot-health checks and auto-rollback the snapshot on failure — stronger than Margine's report-only model, at the cost of surprise rollbacks.
  • Vanilla OS (ABRoot): A/B partition integrity checks before switching; drift detection scoped to the image diff.
  • Fedora Silverblue stock: rpm-ostree status + nothing — the deployment digest is the validation.

Diagnostics bundles

  • Margine collect-diagnostics: flat tarball of command outputs, command-as-header convention.
  • sos report (Fedora/RHEL): the industrial version — plugins, obfuscation profiles; heavyweight but standard for filing distro bugs.
  • Bazzite: ujust device-info / system info exporters tuned for Discord-based support.

Boot/ISO validation

  • Margine: CI QEMU serial-grep gate (qcow2) + manual virt-install lab with OVMF-SB+swtpm for ISOs and the MOK flow.
  • Fedora: openQA — screen-matching, full install matrices; the gold standard, and an entire service to operate.
  • NixOS: declarative QEMU VM tests gating channel advancement — most rigorous, Nix-only.
  • ublue: Titanoboa ISO pipelines smoke-tested mostly by maintainers + community; ADR-0008's research found Bluefin's ISO CI red for 3+ weeks from a silent action-input change — the cautionary tale for gateless pipelines.

Lessons-learned practice

  • Margine: dated docs/lessons-learned/*.md + ADRs + fix-carrying commit messages (the catalog above is assembled from them); validators grow a sentinel per regression.
  • Most small distros keep this in Discord/issue threads — unsearchable and unciteable. If a bug cost you a day, the write-up costs ten minutes and is the only artifact that compounds.

The meta-rule of the whole chapter: every lesson above was converted into either a validator check, a CI sentinel, or a comment at the exact line where the trap is — the knowledge lives where the next mistake would happen, not in a wiki.