Handbook · Chapter 9 of 12 · 23 min read

CI/CD for an OS: GitHub Actions as the build farm

An atomic distro's "release engineering" is a container pipeline. Margine ships from three workflows in margine-image/.github/workflows/:

Workflow Job shape Output
build.yml (672 lines) build_pushsignnotify OCI image → GHCR :candidate
smoke-boot.yml (271 lines) smoke_boot (auto after build) promotion :candidate:stable
build-disk.yml (901 lines) build_disk (qcow2) + build_iso_titanoboapublish_iabump_sitenotify qcow2 + Titanoboa live ISO → Internet Archive

Everything runs on GitHub-hosted ubuntu-24.04. That was not the first choice.

9.1 Why GitHub-hosted (the PVE builder post-mortem)

Margine originally built on a self-hosted runner: a Proxmox VM (margine-builder, VM 170). It was decommissioned, and the workflow header preserves the reason:

# History (2026-06-01): we used to run this on a self-hosted PVE VM
# (margine-builder, VM 170). After two freezes — the second one
# taking the entire PVE host down with ZFS spacemap corruption (see
# proxmox-pve1/docs/operations/zfs-spacemap-corruption-recovery.md)
# — the self-hosted runner has been decommissioned. GitHub-hosted
# is exactly the "container that wakes up when a job arrives and
# shuts down after" model we wanted.

margine-image/.github/workflows/build.yml (header)

The trade: hosted runners give ~14 GiB free disk and 16 GB RAM, no persistence, no babysitting. A 14 GB bootc image build does not fit in 14 GiB — every job's first step reclaims space:

- name: Maximize build space
  # ubuntu-24.04 has ~14 GiB of free disk by default; we need
  # ~30+ GiB for the buildah cache + base image + Margine layers
  # + rechunk staging. This action removes Android/Haskell/.NET/
  # Swift/CodeQL/GHC pre-installed bundles, freeing ~30 GiB.
  uses: ublue-os/remove-unwanted-software@cc0becac701cf642c8f0a6613bbdaf5dc36b259e # v9
  with:
    remove-codeql: true

margine-image/.github/workflows/build.yml:100-107

Note the SHA-pinned action. Every third-party action in these workflows is pinned to a commit SHA with the version as a comment — the build.yml checkout step explicitly cites the tj-actions/changed-files compromise (2025-03) as the reason @vN floating tags are unsafe in a pipeline that holds kernel-signing keys.

9.2 build.yml: triggers, concurrency, build

Triggers — four entry points

on:
  push:
    branches: [main]
    paths-ignore:
      - ".github/workflows/build-disk.yml"
      - "README.md"
      - "CHANGELOG.md"
      - "docs/**"
  pull_request:
    types: [labeled, synchronize]
    branches: [main]
  schedule:
    # Weekly nightly: Sunday 04:00 UTC = 06:00 CEST. Picks up upstream
    # Bluefin DX changes even if there are no commits to this repo.
    - cron: "0 4 * * 0"
  workflow_dispatch:

margine-image/.github/workflows/build.yml:43-69 (trimmed)

  • push with paths-ignore — docs commits don't burn a 25-minute build.
  • schedule — the security-critical one. A bootc image is a frozen snapshot: if you only build on commit, your users stop receiving upstream CVE fixes (Fedora → Bluefin DX → you) the moment you stop committing. The cron rebuild re-pulls ghcr.io/ublue-os/bluefin-dx:stable and republishes even with zero repo changes. Margine runs weekly; ublue-org images do this daily.
  • pull_request only when labeled vm-test — guarded at the job level:
build_push:
  # On pull_request, only build for PRs explicitly labeled `vm-test`.
  # All other PRs (docs, CI tweaks, etc.) skip the 30-min image build.
  if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'vm-test')

margine-image/.github/workflows/build.yml:83-87

Labeled PRs publish a transient :pr-N tag so a lab VM can bootc switch to the PR image before merge.

Concurrency — cancel superseded builds

concurrency:
  group: ${{ github.workflow }}-${{ github.ref || github.run_id }}
  cancel-in-progress: true

margine-image/.github/workflows/build.yml:78-80

Two pushes to main in quick succession: the first 25-minute build is dead weight (its output would be immediately superseded), so it is cancelled. The || github.run_id fallback gives workflow_dispatch/schedule runs their own group so they never cancel each other.

The build step — raw buildah, no wrapper action

sudo -E buildah build \
--file ./Containerfile \
--format docker \
--layers \
--secret id=mok-key,src=/tmp/margine-secrets/MOK.key \
--secret id=mok-cert,src=/tmp/margine-secrets/MOK.pem \
"${TAG_ARGS[@]}" \
"${LABEL_ARGS[@]}" \
.

margine-image/.github/workflows/build.yml:238-247

Margine dropped redhat-actions/buildah-build for a direct shell call (the pattern Bazzite uses): no Node-runtime deprecation warnings, no waiting on the action repo for fixes, and the exact same command works on a laptop. Two side effects worth copying: sudo buildah writes to root storage (/var/lib/containers), which is where rechunk's podman create looks — the old rootless action needed an extra oci-archive bounce; and BuildKit --secret mounts keep the MOK private key out of every layer (chapter 4). The secrets are staged to /tmp/margine-secrets from GitHub Actions secrets and wiped in an if: always() step.

One more reproducibility trick: build scripts fetch validators and branding from the spec repo, so the workflow resolves that ref to a commit SHA at build start, passes it in as --build-arg MARGINE_REF=<sha> (the Containerfile's ARG MARGINE_REF=main is consumed by the fetch scripts), and stamps it as an OCI label (place.the-empty.margine.spec-ref) — so the fetch hits the exact pinned SHA, and any image can be traced back to (and rebuilt from) the exact spec-repo state that produced it.

9.3 Validators as gates inside the build

Static checks ("Layer A") run between buildah build and rechunk/push. The technique: create a container without running it, export the filesystem to a directory, assert against files.

- name: Validate first-boot assets in built image (blocks rechunk)
  run: |
    sudo podman container create --replace --name validate-fs \
      --entrypoint /bin/true \
      "localhost/${{ env.IMAGE_NAME }}:${{ steps.metadata.outputs.version }}"
    ROOTFS=$(mktemp -d)
    sudo podman export validate-fs | sudo tar -C "$ROOTFS" -xf -

margine-image/.github/workflows/build.yml:273-291 (trimmed)

Six sections, each born from a real first-boot regression observed on a fresh install (2026-06-06): A.1 About-panel logo (LOGO=margine-logo in os-release + pixmaps present), A.2 welcome icon is a valid GTK4 symbolic SVG (no embedded raster), A.3 all 10 enabled-extensions UUIDs exist under /usr/share/gnome-shell/extensions/, A.4 first-boot autostart files, A.4.bis offline-docs mirror completeness (≥14 index.html, no live JS/CSS references), A.3.bis dconf keyfiles in /etc/dconf/db/distro.d/.

The dconf checks include sentinel values — a grep for one representative key per keyfile, proving the file content (not just its existence) survived the build:

grep -qE "^border-radius=7" "$DCONF_DIR/02-margine-search-light" || { echo "::error::A.3.bis search-light border-radius!=7 — daniel default lost"; fail=1; }
# dash-to-dock background customisation present (cosmetic regression sentinel)
grep -qE "^running-indicator-style='DOTS'" "$DCONF_DIR/01-margine-dash-to-dock" || { echo "::error::A.3.bis dash-to-dock running-indicator-style sentinel missing"; fail=1; }

margine-image/.github/workflows/build.yml:388-390

Placement matters: the gate runs at ~22 minutes in, before SBOM/rechunk/push/sign, so a regression fails fast and nothing broken ever reaches the registry — not even :candidate.

Lesson: the sentinel that broke the build

  • Symptom: build run 27297409457 failed in the first-boot asset validator: A.3.bis search-light border-radius!=30 — daniel default lost. No file was missing; the keyfile was present and correct.
  • Root cause: the default itself had just been fixed. PR #94 discovered that search-light's border-radius is not pixels but an index into rads = [0,16,18,20,22,24,28,32] — the old 30.0 hit rads[30] = undefined and was silently ignored at runtime. The keyfile was corrected to 7.0 (= 32 px), but the CI sentinel still asserted the old literal 30. A sentinel is a duplicated constant: change the source of truth, and the copy in the gate becomes a tripwire.
  • Fix: same-day commit b4e8680 (ci(validator): search-light border-radius sentinel 30 -> 7) updated the assertion and inlined the rationale so the next editor updates both:
# search-light rounded-corners daniel default: border-radius=7.0
# (the value is an INDEX 0-7 into the extension's px table, not
# pixels — 7 = 32px max rounding; the old 30 was out of range and
# silently ignored. See #94.)
grep -qE "^border-radius=7" "$DCONF_DIR/02-margine-search-light" || ...

margine-image/.github/workflows/build.yml:384-388

Takeaway: sentinel gates are worth the duplication (they catch silent file truncation and staging-order bugs that existence checks miss), but treat the sentinel as part of the change — "update default" PRs must touch the validator in the same commit, or generate the assertion from the keyfile itself.

Validators as the single source of truth, run in-container

The "generate the assertion from the keyfile itself" half of that takeaway is where the chapter's own sentinel-duplication Lesson is finally retired. The grep sentinels were duplicated constants — the keyfile said one thing, the CI step asserted another, and they drifted. The fix is to stop duplicating the check and instead run the real validator against the built image, the same binary the OS ships:

- name: Run image validators (single source of truth)
  run: |
    for v in margine-validate-margine-system margine-validate-branding; do
      sudo podman run --rm -e MARGINE_VALIDATE_CONTEXT=image \
        "localhost/${IMAGE_NAME}:${VERSION}" "$v"
    done

margine-image/.github/workflows/build.yml (Layer A validator step). MARGINE_VALIDATE_CONTEXT=image tells the validator it is inspecting a built rootfs rather than a running system (so it skips checks that need a live session). The decisive property: one validator now runs in three places — here in CI (Layer A), inside the Layer C GUI probe (below), and on a user's machine via ujust margine-doctor (which iterates every /usr/bin/margine-validate-*). There is no second copy of the assertion to drift from the default; if the keyfile and the check disagree, it is one bug in one file.

9.4 Push to GHCR and the job split

After SBOM generation (podman export + syft dir: — the rechunked-image-from-registry path OOMs a 16 GB runner; chapter 10) and hhd-dev/rechunk (repacks layers for OSTree delta efficiency; chapter 3), the push captures the manifest digest:

for tag in ${{ steps.metadata.outputs.tags }}; do
sudo skopeo copy --retry-times 3 \
--dest-creds="${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" \
--digestfile=/tmp/digest.txt \
"${{ steps.rechunk.outputs.ref }}" \
"docker://${IMG_FULL}:${tag}"
...
done
echo "digest=$DIGEST" >> "$GITHUB_OUTPUT"
echo "image_ref=${IMG_FULL}@${DIGEST}" >> "$GITHUB_OUTPUT"

margine-image/.github/workflows/build.yml:483-498 (trimmed)

The digest is a job output consumed by a separate sign job, which cosign-signs image@sha256:... by digest (tag-based signing is racy — the tag can move between push and sign). Why a separate job at all? Failure economics, documented in the header:

# On a failed sign step, `gh run rerun --failed <run-id>` re-runs
# only the sign job (~1 min) instead of redoing the whole build.
# That's the whole point of the split — failure cost dominates
# the few seconds of cross-job overhead.

margine-image/.github/workflows/build.yml:27-30

A final notify job (if: always()) aggregates both results into an ntfy push, with the partial-success case (image pushed, sign failed) spelled out explicitly including the exact gh run rerun --failed command to recover.

9.5 The QEMU smoke gate and :stable promotion

build.yml publishes to :candidate + :candidate.YYYYMMDD — never directly to the tag users track. Layer A checks files; every bug from the 2026-05-28/29 smoke tests (dracut/initramfs, systemd ordering cycle → emergency.target) was a runtime bug Layer A could not see. smoke-boot.yml is Layer B: actually boot the thing.

It auto-triggers on every successful build via workflow_run (guarded so cancelled/failed builds don't waste a runner), builds a qcow2 from the candidate with bootc-image-builder, and boots it under QEMU — GHA ubuntu-24.04 runners have had /dev/kvm since 2024, so boot to desktop is minutes, not hours:

sudo qemu-system-x86_64 \
-enable-kvm \
-m 4096 -smp 4 \
-machine q35 \
-drive if=pflash,format=raw,readonly=on,file="$OVMF_CODE" \
-drive if=pflash,format=raw,file=ovmf_vars.fd \
-drive file="$QCOW",format=qcow2,if=virtio \
-serial file:serial.log \
-display none \
-no-reboot \
&

margine-image/.github/workflows/smoke-boot.yml:152-163 (trimmed)

Design decisions encoded here: no LUKS in the qcow2 (automation can't type a passphrase; encrypted boot is exercised in the manual VM lab), and Secure Boot intentionally off (SB would need the MOK pre-enrolled in the OVMF VARS file; the kernel signature is already asserted at build time — Layer B's question is "does ostree+composefs+systemd reach a usable state", not "is the SB chain intact").

Pass/fail is a grep loop over the serial log. The naive marker broke in practice — systemd on Fedora 44 doesn't reliably print Reached target Multi-User System on serial — so the gate accepts any of three equivalent signals:

for i in $(seq 1 1200); do
  if [[ -f serial.log ]] && grep -qE "Started.*gdm\.service|Reached target graphical\.target|margine login:" serial.log; then
    echo "✓ Boot reached usable state at second $i"
    echo "passed=true" >> $GITHUB_OUTPUT
    ...

margine-image/.github/workflows/smoke-boot.yml:184-191

The budget is 20 minutes, not 30 seconds: first boot pulls 2–3 GB of Flatpaks via flatpak-preinstall.service plus ostree-finalize-staged. On failure the serial log is uploaded as an artifact, with a pre-digested triage dump (Reached target vs Failed to start, most-restarted units) printed in the job log.

Promotion: same bytes, new name

A "Resolve image ref to digest" step runs first and pins the candidate to an immutable digest once, and every later step (the qcow2 build, the boot, the promotion) consumes that one ${PINNED} value:

- name: Resolve image ref to digest
  id: ref
  run: |
    DIGEST="$(sudo skopeo inspect --no-tags --format '{{.Digest}}' "docker://$REF")"
    echo "pinned=${BASE}@${DIGEST}" >> "$GITHUB_OUTPUT"

- name: Promote candidate → stable (only if boot passed)
  if: success() && steps.boot.outputs.passed == 'true'
  run: |
    for promo_tag in stable "stable.${DATE_TAG}" "${DATE_TAG}"; do
      sudo skopeo copy --retry-times 3 --preserve-digests \
        "docker://${PINNED}" \
        "docker://${REGISTRY_IMAGE}:${promo_tag}"
    done

margine-image/.github/workflows/smoke-boot.yml (resolve + promote steps).

skopeo copy --preserve-digests is a registry-side tag move: no rebuild, no re-rechunk — the digest promoted to :stable is byte-identical to the manifest that just booted, and the cosign signature made by digest stays valid. :stable.YYYYMMDD and :YYYYMMDD give users pinnable rollback targets. Policy in one sentence: no image reaches :stable without having booted in QEMU.

This closed a previously-void gate. The earlier version resolved :candidate independently in the qcow2-build step and again in the promotion step — so if a new build finished mid-smoke, the gate booted one digest and skopeo copy promoted whatever :candidate pointed at by then (a different, never-tested digest). Resolving to ${PINNED} once makes "booted" and "promoted" provably the same bytes (code-quality review finding A1; the per-tag re-resolve was A2). A guard was also added so the three stable tags can't split across two digests: this is the only workflow that mutates :stable, so it carries concurrency: { group: smoke-boot, cancel-in-progress: false } — concurrent promotions queue instead of racing, and a run is never cancelled mid-skopeo copy.

Layer C: a GUI smoke probe

Layer B answers "did userspace come up" — it greps the serial log for gdm.service/graphical.target. But a GNOME session can reach graphical.target with a gnome-shell that immediately crashes on a bad extension: the login screen appears, the user's session never does. To catch that class, a third layer boots the qcow2 with a throwaway autologin user and a root oneshot that interrogates the live session, printing its verdict to the same serial console the watcher already reads:

# .github/smoke/gui-probe.sh (run as margine-gui-smoke.service in the VM)
pgrep -u smoke -x gnome-shell >/dev/null || fail "gnome-shell never started"
sleep 30   # let extensions load
EXT=$(runuser -u smoke -- gnome-extensions list --enabled | wc -l)
[[ "$EXT" -ge 6 ]] || fail "only $EXT extensions enabled (expected >=6)"
pgrep -u smoke -x gnome-shell >/dev/null || fail "gnome-shell died during the probe"
coredumpctl -q list 2>/dev/null | grep -q gnome-shell && fail "gnome-shell dumped core"
out "MARGINE-GUI-SMOKE: PASS ext=$EXT"

margine-image/.github/smoke/gui-probe.sh + margine-gui-smoke.service, injected offline into the qcow2 by .github/scripts/inject-gui-probe.sh (GDM autologin + the oneshot + a permissive-SELinux karg for this one boot). Injection runs continue-on-error so a failed injection can never block the Layer B gate, and the unit is After=graphical.target with its wants-symlink in graphical.target.wants (the first deployment hooked it into multi-user.target.wants, creating an ordering cycle that made systemd silently skip it — a "no verdict" non-result). The verdict is warn-only until two consecutive green runs prove it isn't flaky (both achieved 2026-06-13); then it becomes gating.

Lesson — "reached graphical.target" is not "the desktop works". Symptom: a candidate passed Layer B (login screen reached) but the autologin session showed a black screen; gnome-shell was respawning. Root cause: a crashing GNOME extension took down the shell after graphical.target was reached. Layer B's grep can't see past the target; it never logs into a session. Fix: Layer C logs in as a disposable user and checks the things a human would notice — shell alive, ≥6 extensions enabled, no gnome-shell coredump, no Clutter Bail out! in the journal — and prints MARGINE-GUI-SMOKE: PASS/FAIL to serial. Catching a crashing-extension regression that Layer B passes is exactly the gap it exists to close.

Layer C, part two: a soft user-smoke gate

Layer C (above) asks "is the session alive?". A second injected oneshot asks a sharper question: "is this Margine, or just some GNOME?". inject-gui-probe.sh now stages a second payload alongside the GUI probe — .github/smoke/user-smoke-probe.sh + margine-user-smoke.service — and the extra injection is guarded, so a missing payload only warns (the GUI probe still goes in; you never lose the whole gate to a renamed file).

# margine-image/.github/smoke/user-smoke-probe.sh (shape — every check WARN-only)
check KERNEL      "uname -r | grep -q cachyos"
check GDM         "systemctl is-active --quiet gdm && pgrep -u smoke -x gnome-shell"
check OTILING     "enabled-extensions contains o-tiling@oliwebd.github.com"
check SEARCHLIGHT "enabled-extensions does NOT contain search-light@icedman.github.com"
check KEYBINDS    "Hyprland-style binds present in the booted user dconf"
check GAMING      "ujust --list | grep -q margine-gaming"
check GSCHEMA     "gsettings get org.gnome.desktop.interface accent-color == 'yellow'"

The probe asserts Margine identity — the signed CachyOS kernel actually booted, the session is up, o-tiling is enabled, the Hyprland-style binds are present, search-light is gone, the gaming recipe shipped, and the zz1-margine gschema override took. But it never fails: it always exit 0 and writes MARGINE-USER-SMOKE: <CHECK> <PASS|WARN> lines to the same serial console a smoke-boot.yml step parses into $GITHUB_STEP_SUMMARY. A regression shows up as a table on the run, not a red X.

It is non-blocking three ways on purpose — if: always() on the parse step, continue-on-error: true, and a trailing || true. Promotion to :stable still keys solely on steps.boot.outputs.passed (Layer B). The identity probe is a dashboard, not a veto: it tells you "this still looks like Margine" without ever standing between a booting image and :stable.

The wants-symlink lives in graphical.target.wants — deliberately, not multi-user.target.wants. Hooking a After=graphical.target unit into multi-user.target.wants re-creates the ordering-cycle skip bug from §9.5 (systemd silently drops the unit, and you get a "no verdict" non-result that reads as success).

9.6 Disk images and ISOs: build-disk.yml

The OCI image updates installed systems; the ISO/qcow2 pipeline creates new ones. It is manual-trigger only (workflow_dispatch, plus PR runs on disk_config//live-env/ path changes) — ISOs are ~5–9 GB, built per release event, not per push. The ISO is built by a separate Titanoboa job (§10.2); the BIB-driven build_disk job now produces only the smoke-gate qcow2 (the anaconda-iso matrix entry was removed in ADR-0008 Phase 5/7):

matrix:
  image: ["margine"]
  disk-type: ["qcow2"]

margine-image/.github/workflows/build-disk.yml (build_disk matrix).

Notables in the build_disk job (and the retired anaconda-iso path it once carried):

  • BIB pinned by digest (quay.io/centos-bootc/bootc-image-builder@sha256:7ae88…) and pre-pulled with 8-attempt exponential backoff, because quay.io 5xx brownouts otherwise surface as a single opaque failed pull inside the action.
  • rootfs: btrfs is mandatory: Bluefin DX doesn't set the containers.bootc.rootfs OCI label, so BIB errors with "DefaultRootFs missing" without it.
  • Installer-image pattern (Bazzite) (historical — only the retired anaconda-iso path used it): a transient margine-installer:run-<run_id> image was built first — base image + ~29 Flatpaks baked into /var/lib/flatpak — and that fed to BIB, so the kickstart only rsynced Flatpaks instead of downloading them in the installer environment (which OOM'd /tmp and failed silently; chapter 8). The build needed --cap-add sys_admin --security-opt label=disable because flatpak install uses bwrap user namespaces inside the container. The Titanoboa path keeps the same trick in live-env (§10.2).
  • GHCR garbage collection: each ISO run pushes a new run-scoped tag and GHCR keeps everything forever, so an always() step prunes the package via gh api, keeping the newest 3 versions.
  • Checksums with relative paths: SHA256SUMS is written with paths relative to the output dir, because absolute build-side paths broke sha256sum -c after the artifact was re-unpacked in the publish job at a different root (run #26789024483).

BTRFS loopback: buying disk with compression

The Titanoboa live-ISO job (ADR-0008, now the default ISO build) squashes a ~14 GB rootfs at zstd-19 while also holding the base image — past what remove-unwanted-software can free on /. The fix, mirrored from Bazzite's workflow, is to back podman's storage with a compressed BTRFS loopback on the runner's ~70 GB ephemeral /mnt SSD:

- name: Mount container storage on a BTRFS loopback
  run: |
    sudo truncate -s 80G /mnt/podman-storage.img
    sudo mkfs.btrfs -f /mnt/podman-storage.img
    sudo podman system reset --force || true
    sudo systemctl stop podman.service podman.socket 2>/dev/null || true
    sudo mount -o compress-force=zstd:2 /mnt/podman-storage.img /var/lib/containers/storage

margine-image/.github/workflows/build-disk.yml:371-383 (trimmed)

The file is sparse (truncate -s 80G on a 70 GB disk is fine until actually filled) and compress-force=zstd:2 makes OS payloads occupy roughly half their nominal size — an 80 G logical budget on the cheap.

9.7 Artifact egress pain → Internet Archive

GitHub will happily store a 9 GB ISO as a workflow artifact — and then serve it to a residential connection at ~1–1.5 MB/s (2–4 hours for 8 GB, per the header of publish-titanoboa-test-iso.yml). GHA artifacts are a job-to-job handoff mechanism, not a distribution channel. Margine's answer is the Internet Archive:

- name: Upload to Internet Archive (torrent-first distribution)
  # IA auto-generates a BitTorrent .torrent + magnet + 3 HTTP
  # mirrors for everything we upload, and seeds it forever. This
  # is the same pattern Bluefin/Bazzite use to avoid Cloudflare
  # TOS (no large binary content served from origin) and to keep
  # our home-server upload bandwidth free.
  run: |
    ia --config-file "$IA_CONFIG_FILE" --debug upload "$IDENTIFIER" \
      "$ARTIFACT" "$OUTDIR/SHA256SUMS" \
      --retries 5 \
      --sleep 60 \
      --metadata="mediatype:software" \
      --metadata="collection:opensource" \
      ...

margine-image/.github/workflows/build-disk.yml:558-611 (trimmed)

publish_ia is a separate job downstream of build_disk (artifact handoff over GHA's fast internal CAS) for the same rerun-economics reason as sign: IA's S3 ingest is the flaky, slow step — when it fails, gh run rerun --failed redoes the upload in minutes instead of the 15–17 min BIB build. Its timeout is 350 minutes, bumped from 180 after a real run was killed mid-upload of a 9 GB ISO. After upload, the job polls up to 25 minutes for IA's derive process to produce the .torrent, regenerates SHA256SUMS for IA's flat published layout (files are siblings at the item root, not under bootiso/), and emits a static index.html with torrent/HTTP/IA links.

Two satellites complete the release loop:

  • bump_site: opens (and auto-squash-merges) a PR against the website repo bumping a single LATEST_ISO_DATE constant, which drives all four download URLs on the site. A fine-grained PAT (SITE_BUMP_TOKEN) scoped to that one repo; if absent, the job no-ops with a warning instead of failing the release.
  • publish-titanoboa-test-iso.yml: pushes throwaway validation ISOs to IA's test_collection, which auto-expires items after ~30 days — fast downloads for hardware testing, zero cleanup.

9.8 Alternatives & other distros

Build platform

  • GitHub Actions, hosted runners (Margine, Bluefin, Bazzite, Aurora, most ublue customs): zero ops, free for public repos, KVM available; pain is the 14 GiB disk (hence remove-unwanted-software / BTRFS loopback) and 6 h job cap.
  • ublue-os main-org patterns: reusable/callable workflows + large matrices (image × flavor × Fedora version), org-wide cosign keys, just recipes so CI == laptop; the right model once you maintain >3 images — Margine's single-image repo inlines everything instead.
  • Self-hosted runners: unlimited disk/CPU, cache persistence — at the cost of patching, runner-token security on public repos (PR code execution!), and your hypervisor becoming a dependency; Margine's PVE builder took the whole host down with it (ZFS spacemap corruption) and was retired.
  • GitLab CI (used by Fedora project infra and many corporates): built-in registry, DAG via needs:, but shared SaaS runners lack KVM — a QEMU smoke gate needs self-hosted runners, recreating the babysitting problem.
  • Distro-scale build systems: Fedora Koji/Pungi + OSBuild (Silverblue stock), openSUSE OBS (MicroOS/Aeon), NixOS Hydra — reproducible, multi-arch, audited; massive operational footprint, wrong size for a one-person distro.
  • Vanilla OS: Vib build recipes on GitHub Actions producing ABRoot OCI images — same GHA+GHCR shape, different image format.

Gating before release

  • Margine: file validators in-build + QEMU serial-grep smoke boot, promotion by skopeo copy --preserve-digests. Cheap, catches "does it boot".
  • ublue-os: bootc container lint + image-level checks; Bazzite adds a large community of :testing-channel users as the de-facto smoke test.
  • Fedora: openQA — full GUI-driven install/boot test matrix; the gold standard, and a service to run, not a workflow step.
  • NixOS: NixOS test framework (declarative QEMU VM tests in Nix, gating Hydra channels) — the most rigorous; requires buying into Nix wholesale.
  • ChimeraOS: GitHub Releases + staged update channels; users are the gate.

Tag/promotion models

  • candidate → tested → stable retag (Margine): one build, promotion is metadata. ublue equivalents: :testing/:latest/:gts channels (Bluefin), date-pinned tags everywhere.
  • Rebuild-per-channel (some templates): simpler workflows, but the stable artifact is not the tested artifact — avoid.
  • NixOS channels: an entire package-set generation advances atomically when Hydra tests pass; same philosophy, different granularity.

Heavy-artifact distribution

  • Internet Archive, torrent-first (Margine): free, permanent, auto-mirrored; ingest is slow and occasionally 503s (hence retries + 350-min timeout).
  • CDN / object storage (Bazzite, Bluefin ISO endpoints; Cloudflare R2 / B2): fast and branded; egress cost or TOS exposure for multi-GB binaries.
  • GitHub Releases (ChimeraOS, Vanilla OS): simple, 2 GiB-per-file limit forces split archives for full ISOs.
  • GHA artifacts: job handoff only — throttled egress makes them unusable as a download channel.

9.9 The /status freshness dashboard

The website's /status page answers one question — "is the Margine you'd install today current with upstream, or stale/broken?" — from a single JSON document the CI produces. build-status-json.sh emits a schemaVersion: 2 doc describing the whole Fedora → Bluefin → Margine chain: it reads skopeo inspect of both bluefin-dx:stable and margine:stable (version/date/digest/labels) and the latest meaningful run conclusion via gh api.

# margine-image/.github/scripts/build-status-json.sh (shape)
skopeo inspect docker://ghcr.io/ublue-os/bluefin-dx:stable   # version/date/digest/labels
skopeo inspect docker://ghcr.io/daniel-g-carrasco/margine:stable
gh api .../actions/runs --jq 'first conclusion in success|failure|timed_out'

Two subtleties make it honest rather than merely green:

  • Meaningful runs only. cancel-in-progress (§9.2) leaves a trail of cancelled/skipped runs; if the latest run is one of those the page reads "Unknown". The script walks back to the latest run whose conclusion is success, failure, or timed_out and reports that.
  • A health() map normalises raw conclusions to the page's vocabulary. The Margine layer is unknown when the image can't be inspected at all (a registry blip must never let the page assert green), behind when its org.opencontainers.image.base.digest label ≠ the current bluefin-dx:stable digest, and failed on a failed/timed-out build or smoke.

A guard aborts the producer if both skopeo inspects come back empty, so a transient registry outage can't overwrite the last-good document with an all-unknown one. publish-status-json.sh then pushes the JSON straight to the website repo's main (see §9.12 for why no PR), rebasing on a push race, preserving the curated kernel value already published, and skipping the commit when only the timestamp would change (no churn). status-json.yml runs the pair after every build/smoke/ISO (workflow_run), daily, and on demand.

To make the behind check possible, build.yml stamps the image with the Bluefin digest it was built from — best-effort, so a lookup failure never fails a build:

# margine-image/.github/workflows/build.yml (base-digest label step)
- name: Resolve base image digest (best-effort)
  continue-on-error: true
  run: |
    DIGEST="$(skopeo inspect --no-tags --format '{{.Digest}}' \
      docker://ghcr.io/ublue-os/bluefin-dx:stable)"
    echo "digest=$DIGEST" >> "$GITHUB_OUTPUT"
# → label org.opencontainers.image.base.digest=<digest>

9.10 GHCR retention: pruning the tag-move orphans

Every daily run moves :stable/:candidate (and their dated siblings) to a fresh digest. The old digest doesn't vanish — it becomes an untagged orphan version GHCR keeps forever. ghcr-cleanup.yml (the SHA-pinned dataaxiom/ghcr-cleanup-action) prunes them:

# margine-image/.github/workflows/ghcr-cleanup.yml (trimmed)
with:
  keep-n-untagged: 3
  exclude-tags: "stable,latest,candidate,stable.*,candidate.*,pr-*,2*"
  validate: true
  dry-run: ${{ github.event_name == 'workflow_dispatch' && inputs.dry_run || false }}

exclude-tags covers the named, dated (2*), and pr-* tags so only genuine orphans are eligible; validate: true re-checks the manifest list before deletion. The daily cron does the real prune; manual workflow_dispatch defaults to dry-run so you can read the kill list before arming it. The first real run reaped ~315 orphaned versions.

The gotcha that bit us: in that action delete-untagged and keep-n-untagged are mutually exclusive — set both and it errors out before doing anything. Use keep-n-untagged (which retains a small rollback window of recent orphans) and drop delete-untagged.

9.11 Pin + ref automation

The supply-chain pins (§8) are kept honest by CI, not by human memory (o-tiling once sat at 2.8.8 right through the 2.8.11 GNOME-50 fix because nothing watched it):

  • o-tiling release pin. Renovate tracks the GitHub-release version through a customManager matching the OTILING_VERSION constant. Hosted Renovate can't hash a release zip, so a companion otiling-pin-sha.yml recomputes the sha256 on Renovate's own branch and commits it back — the bot opens the version bump, the workflow fills in the hash.
  • EGO + fork pins. check-upstream-pins.yml watches the EGO-hosted extension version_tag pins (hide-cursor, smile) and the Titanoboa fork, opening an issue when upstream moves.

Separately, validate-flatpak-refs.yml runs validate-flatpak-refs.sh, the pure-Flatpak analog to gaming-native's rpm depsolve dry-run:

# margine-image/.github/scripts/validate-flatpak-refs.sh (shape)
# parse every app ID out of the recipes' `flatpak install` lines …
for id in "${IDS[@]}"; do
  curl -fsS "https://flathub.org/api/v2/appstream/$id" >/dev/null \
    || fail "$id no longer on Flathub (renamed/delisted?)"
done

It checks every Flatpak the recipes install — the AI layer's com.jeffser.Alpaca (+ its Plugins.AMD ROCm extension) and the gaming set — against the Flathub API on recipe PRs and weekly, so a renamed or delisted app is caught in CI instead of at the user's ujust margine-ai / margine-gaming, where it would fail at install time.

9.12 Cross-repo bumps that actually land

The website repo is private on a free plan: no branch protection, and "Allow auto-merge" is OFF. That collided with the original bump-site-iso-date.sh, which after each IA ISO publish opened a PR and ran gh pr merge --auto. With auto-merge disabled that command errors — so the one-line date bump sat as an open PR every release while the live site kept advertising the previous ISO. The failure surfaced only as a ::warning:: on an otherwise-green job, so it went unnoticed for several releases.

The fix: stop round-tripping a PR nothing can merge. Commit the one-line bump and push straight to main with a rebase-retry, and exit 1 (red job) on real failure so it can't fail silently again.

# margine-image/.github/scripts/bump-site-iso-date.sh (shape)
sed -i "s/LATEST_ISO_DATE = .*/LATEST_ISO_DATE = \"$NEW_DATE\";/" "$SITE_INDEX"
git commit -aqm "chore(release): bump LATEST_ISO_DATE to $NEW_DATE"
for attempt in 1 2 3; do
  git push origin HEAD:main && exit 0
  git fetch origin main && git rebase origin/main || git rebase --abort
done
echo "::error::could not push the site bump"; exit 1

publish-status-json.sh (§9.9) reuses the same direct-push pattern.

Lesson — match the merge mechanism to the repo. For a private/free repo with no branch protection and no auto-merge, a deterministic bot bump should push to main, not open a PR that nothing on the plan can merge. A PR is for review you'll actually do; a date bump is neither reviewed nor mergeable here, so the PR is pure latency that silently rots — and a ::warning:: on a green job is invisible. Make the genuine failure path red.