Handbook · Chapter 11 of 12 · 14 min read
Shipping and day-2 operations
A bootc distro has two delivery products: the OCI image (the thing installed systems track daily) and the install media (the thing new users download once). They have different bandwidth profiles, different trust models, and different failure modes — Margine ships them through two different channels: GHCR for the image, Internet Archive for the ISO. Day-2 is everything after: upgrade orchestration, staged deployments, rollback, /etc drift.
11.1 GHCR tag strategy
Margine publishes exactly one image name with a small, rigid tag grammar:
| Tag | Written by | Meaning |
|---|---|---|
:candidate |
build.yml on every main push / weekly cron |
Built, statically validated, not yet boot-tested |
:candidate.YYYYMMDD |
build.yml |
Dated candidate, for forensics |
:pr-N |
build.yml on PRs labeled vm-test |
Transient; rebase a lab VM onto it, GC'd later |
:stable |
smoke-boot.yml promotion step |
The only tag clients track |
:stable.YYYYMMDD, :YYYYMMDD |
smoke-boot.yml |
Dated stable aliases for pinning/rollback by date |
There is deliberately no :latest. :latest conflates "most recently built" with "recommended"; on an OS image those must differ, because the recommendation gate (a real boot) runs after the build. The tags are emitted by docker/metadata-action:
# margine-image/.github/workflows/build.yml
tags: |
type=raw,value=${{ env.CANDIDATE_TAG }},enable=${{ github.event_name != 'pull_request' }}
type=raw,value=${{ env.CANDIDATE_TAG }}.{{date 'YYYYMMDD'}},enable=${{ github.event_name != 'pull_request' }}
type=ref,event=pr,prefix=pr-,enable=${{ github.event_name == 'pull_request' }}
Pushes capture the manifest digest once and reuse it — all tags point at the same manifest, and cosign signs by digest, never by tag (tag-based signing is racy: the tag can move between sign and verify):
# margine-image/.github/workflows/build.yml — "Push rechunked image to GHCR"
for tag in ${{ steps.metadata.outputs.tags }}; do
sudo skopeo copy --retry-times 3 \
--dest-creds="${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" \
--digestfile=/tmp/digest.txt \
"${{ steps.rechunk.outputs.ref }}" \
"docker://${IMG_FULL}:${tag}"
if [[ -z "$DIGEST" ]]; then
DIGEST="$(cat /tmp/digest.txt)"
fi
done
echo "image_ref=${IMG_FULL}@${DIGEST}" >> "$GITHUB_OUTPUT"
--digestfile is the load-bearing flag: the sign job receives ${IMG_FULL}@sha256:... and never resolves a tag.
Promotion: digest-preserving copy, gated on a real boot
smoke-boot.yml resolves :candidate to an immutable digest once (the "Resolve image ref to digest" step → ${PINNED}), boots that qcow2 in QEMU (chapter 9), and only then promotes the exact digest it booted:
# margine-image/.github/workflows/smoke-boot.yml — "Promote candidate → stable"
- name: Promote candidate → stable (only if boot passed)
if: success() && steps.boot.outputs.passed == 'true'
run: |
DATE_TAG="$(date -u +%Y%m%d)"
for promo_tag in stable "stable.${DATE_TAG}" "${DATE_TAG}"; do
sudo skopeo copy --retry-times 3 --preserve-digests \
"docker://${PINNED}" \
"docker://${REGISTRY_IMAGE}:${promo_tag}"
done
--preserve-digests guarantees :stable is bit-identical to the manifest that actually booted — no rebuild, no re-rechunk between test and release. The promotion is a registry-side pointer move. Pinning to ${PINNED} (rather than re-resolving the moving :candidate tag) closed a void-gate window where a build finishing mid-smoke could get the never-booted digest promoted; a concurrency: group: smoke-boot (queue, don't cancel) keeps two promotions from splitting the stable tags across digests.
Digest pins on the client
A client can freeze on a known-good build with either form:
sudo bootc switch ghcr.io/daniel-g-carrasco/margine:stable.20260608 # dated alias
sudo bootc switch ghcr.io/daniel-g-carrasco/margine@sha256:<digest> # hard pin
A hard digest pin disables bootc upgrade progress by definition (the ref never changes); dated aliases are the practical middle ground. Margine also ships a client-side watchdog so a silent pipeline failure doesn't leave users unknowingly frozen:
# margine-image/build_files/system_files/usr/libexec/margine-staleness-check
r = run(["skopeo", "inspect", "--no-tags", f"docker://{image_ref}"])
created = json.loads(r.stdout)["Created"] # ISO 8601
age_days = (time.time() - created_ts) / 86400
if age_days < WARN_AGE_DAYS: # 7 days, critical at 14
sys.exit(0)
A user systemd timer (every 12 h, installed via /etc/skel) runs skopeo inspect against the booted image ref and raises a desktop notification when :stable is older than 7 days — "either the build pipeline is broken, or upstream has genuinely paused; either way the user should know."
11.2 ISO distribution: torrent-first via Internet Archive
The ISO is ~5-9 GB and the origin server is a home box behind Cloudflare Free. The distribution model (from margine-fedora-atomic/docs/19-iso-distribution.md):
build-disk.yml
├──▶ Internet Archive (`ia upload`)
│ ↓ IA derives torrent + 3 HTTP mirrors, seeds forever
└──▶ rsync to edge VM (files.the-empty.place)
↓ index.html (magnet + IA mirror links), SHA256SUMS, 7-day .iso fallback
Rationale, condensed: Cloudflare Free TOS discourages serving large binaries from the proxy; an ADSL-class uplink dies under one concurrent ISO download; and the home server should not be a single point of failure for past releases. IA solves all three: it auto-derives a .torrent + magnet + HTTP mirrors for every upload and hosts them indefinitely, while the origin serves only HTML and checksums.
The upload runs in a separate job (publish_ia) downstream of build_disk, connected by a GHA artifact. This split exists purely for rerun isolation: when the IA upload fails (it does — S3 ingest 503s under load), gh run rerun --failed <run-id> redoes only the upload, not the 15-17 min bootc-image-builder run.
# margine-image/.github/workflows/build-disk.yml — publish_ia
ia --config-file "$IA_CONFIG_FILE" --debug upload "$IDENTIFIER" \
"$ARTIFACT" "$OUTDIR/SHA256SUMS" \
--retries 5 \
--sleep 60 \
--metadata="mediatype:software" \
--metadata="collection:opensource" \
--metadata="title:${TITLE}" \
...
Hard-won ia 5.x CLI facts encoded in the workflow comments: ia upload --verbose does not exist (run #26787945599 failed on it); top-level -l is a flag, not -l info (run #26789968571 — argparse ate info as the positional command); --debug is the actual progress knob. --retries 5 --sleep 60 because IA's S3 endpoint 503s routinely on multi-GB multipart uploads. After upload, a 25-minute poll loop waits for *_archive.torrent to appear (ia list "$IDENTIFIER" | grep '_archive\.torrent$') before generating the index page, degrading to HTTP-only links with a warning if derive is slow.
Lesson — SHA256SUMS paths must match the published layout. Symptom: run #26789024483's
publish_iafailedsha256sum -c SHA256SUMSon the downloaded artifact. Root cause: the build side wrote SHA256SUMS with build-side paths (bootiso/install.iso); after artifact transit and on IA — where the file is served at the item root — those paths resolve nowhere. Two layouts, one checksum file. Fix: generate relative to the artifact dir at build time, then regenerate for the published layout before upload:# build-disk.yml — "Locate ISO + verify integrity" (cd "$OUTDIR" && sha256sum -c SHA256SUMS) # verify artifact transit ( cd "$(dirname "$ARTIFACT")" && sha256sum "$BASE" ) > "$OUTDIR/SHA256SUMS" # rewrite: basename onlyEnd-user UX contract: download
install.iso+SHA256SUMSas siblings from IA, runsha256sum -c SHA256SUMS, done.
Lesson — size your timeouts to the slow third party, not your build. Symptom: run #27166954601's
publish_iawas cancelled at the 180-minute job cap mid-upload. Root cause: IA's S3 ingest for a ~9 GB ISO ran past 3 h during a degraded window. Fix:timeout-minutes: 350(GHA hard cap is 6 h) and rely on the job split for retries. The comment in the file documents the incident inline — workflows are the changelog.
11.3 The website pipeline is part of the product
The download page is not hand-maintained. The site (margine-os-1084ca72, served at margine.the-empty.place) hardcodes four release URLs (IA details page, .torrent, direct HTTP, SHA256SUMS) derived from a single constant LATEST_ISO_DATE in src/routes/index.tsx. After publish_ia succeeds, a bump_site job in the same workflow opens — and auto-merges — a PR against the site repo:
# margine-image/.github/workflows/build-disk.yml — bump_site
sed -i "s|LATEST_ISO_DATE = \"$OLD_DATE\"|LATEST_ISO_DATE = \"$NEW_DATE\"|" src/routes/index.tsx
...
gh pr create --repo daniel-g-carrasco/margine-os-1084ca72 \
--base main --head "$BRANCH" \
--title "chore(release): bump LATEST_ISO_DATE to ${NEW_DATE}" ...
gh pr merge ... --squash --auto --delete-branch \
|| echo "::warning::bump PR auto-merge failed — falls back to manual squash-merge"
A webhook deploy picks the merge up in ~2-3 minutes; the maintainer does nothing per-release. The cross-repo write uses a fine-grained PAT (SITE_BUMP_TOKEN, Contents+PR write scoped to the site repo only) and the job no-ops with a warning when the secret is absent instead of failing the release. Idempotency guards: skip if the constant already equals today's date; skip if a PR for the same date branch is already open.
One UX detail preserved in the comments: the Hero used to expose a magnet:? button composed from the torrent's btih, retired 2026-06-07 because Fragments (the preinstalled torrent client) rejected valid magnets with arbitrary tracker lists — the button now links to the .torrent file, which LATEST_ISO_TORRENT derives from the same date constant. Release automation shrank to a single-variable bump.
11.4 Client side: bootc upgrade + uupd orchestration
Margine maintains no update orchestrator of its own. The declaration is explicit:
# margine-fedora-atomic/declarations/margine-atomic.yaml
updates:
orchestrator: bluefin-uupd
system:
engine: bootc
transport: ostree-image-signed
image_ref: ghcr.io/daniel-g-carrasco/margine:stable
require_reboot_judgment: true
Bluefin DX ships uupd (Universal Updater) with uupd.timer enabled; Margine inherits the unit unchanged. Per docs/01-architecture.md, one daily pass orders:
bootc upgrade(orrpm-ostree upgradeon layered installs);flatpak update(system + user);brew update && brew upgradeif Homebrew is present;distrobox upgrade --all;- reboot indication via
notify-sendwhen a new deployment is staged.
Host image, Flatpaks, brew, and distrobox containers move in one pass with one failure surface — the practical reason to prefer an orchestrator over N independent timers. The history matters here: Margine's earlier scripts/update-all (an rpm-ostree-first orchestrator with pre/post validators) and its Topgrade accessory profile were retired when the project moved onto Bluefin (ADR 0004 superseded by 0005). The Topgrade config survives as documentation of the boundary it enforced:
# margine-fedora-atomic/config/topgrade.toml
[misc]
disable = [ "system", "firmware" ]
[linux]
rpm_ostree = false
bootc = false
Even when a generic updater can drive the base OS, don't let it: a base update stages a kernel, interacts with Secure Boot and rollback, and deserves a tool that understands deployments. The validators (validate-atomic-layout, validate-cachyos-kernel, ...) are deliberately on-demand health checks, not update hooks — they never block or gate uupd.
Context-awareness corollary: environments where updating is wrong must opt out. The live ISO disables the whole update surface at build time:
# margine-image/live-env/src/build.sh — units that must not run in a live session
for unit in \
rpm-ostree-countme.service rpm-ostreed-automatic.timer bootloader-update.service \
flatpak-preinstall.service brew-setup.service brew-upgrade.timer brew-update.timer \
uupd.timer ublue-system-setup.service tailscaled.service; do
if systemctl list-unit-files "$unit" >/dev/null 2>&1; then
systemctl disable "$unit"
fi
done
Defensive list-unit-files guard (Bazzite pattern): a renamed upstream unit never fails the ISO build.
Staged deployment and reboot
bootc upgrade pulls the new manifest, checks out a new deployment under /ostree/deploy/, and stages it. Nothing visible changes until reboot; BLS bootloader entries are written by ostree-finalize-staged.service during shutdown, not at stage time. This is observable and worth teaching, because it confuses everyone once — validate-staged-deployment distinguishes the two states explicitly:
# margine-fedora-atomic/scripts/validate-staged-deployment
if [[ "$IS_STAGED" == "true" ]]; then
info "Deployment is STAGED (bootc switch flow). BLS entries are not"
info "rewritten now — ostree-finalize-staged.service does that at the"
info "next shutdown, so GRUB sees the new entry on the boot AFTER."
ok "BLS entry update is correctly deferred (this is normal)"
else
# rpm-ostree rebase / non-staged path: entry must already exist.
BLS_ENTRY=$("${SUDO[@]}" grep -lr "${STAGED_HASH:0:32}" /boot/loader/entries/ ...)
rpm-ostree rebase writes BLS entries immediately ("pending"); bootc switch/upgrade defers them ("staged"). A validator that asserts "entry must exist" fails spuriously on the bootc path unless it knows the difference.
After the reboot, a login-time oneshot compares the booted digest against the last recorded one and tells the user what just happened (/usr/libexec/margine-upgrade-notify, wired via /etc/skel/.config/systemd/user/): "Now running: stable.20260608 / Digest: sha256:ab12...". Reboots that silently apply OS updates erode trust; a one-line toast fixes that.
11.5 Rollback, pinning, /etc merge and drift
Rollback. bootc keeps the previous deployment on disk. sudo bootc rollback flips the boot order so the previous deployment boots next; the GRUB menu offers the same choice interactively. Because /etc is per-deployment and /var is shared, rolling back reverts OS content and config defaults but not user data.
Pinning. Deployments are garbage-collected as new ones land (only current + previous are kept). Before a risky change, pin:
# margine-fedora-atomic/docs/02-install-lab.md — before the CachyOS kernel experiment
rpm-ostree status
sudo ostree admin pin 0
rpm-ostree status
A pinned deployment survives any number of upgrades as a boot-menu fallback — this is the documented prerequisite in Margine's lab runbook before kernel swaps. Unpin with ostree admin pin --unpin <index>.
/etc merge rules. On every deployment, ostree performs a 3-way merge of /etc: the factory defaults shipped in the image (/usr/etc), the previous defaults, and your current /etc. Files you never touched track the image; files you modified keep your version, even when the image's default changes underneath. Audit the drift with:
sudo ostree admin config-diff # M = locally modified vs factory, A = locally added
That "local wins forever" rule is the main day-2 footgun: a stale local edit can mask an upstream fix indefinitely. Margine's mitigation is structural — ship configuration in /usr (gschema overrides, dconf db, systemd units in /usr/lib) and keep /etc for machine-local state, so the merge has nothing contentious to do.
Lesson — your packaging pipeline can eat /etc. Symptom: fresh-VM rebase boots with a 1-entry
/etc/passwd; Layer A in CI had verified 65 entries at the end of buildah (Bug 6 v2, 2026-05-31). Root cause: rechunk re-commits the image as an ostree-canonical tree and strips/etc/passwd+/etc/groupfrom/usr/etc— so the factory side of the 3-way merge is empty on first deploy. The image you tested in CI is not byte-for-byte the tree the client checks out. Fix: a boot-time idempotent seed that merges/usr/lib/passwd(the systemd factory copy, which rechunk does preserve) into/etc, gated on the stripped state:# margine-image/build_files/system_files/usr/libexec/margine-seed-etc-passwd # Runs only if /etc/passwd has fewer than 20 entries (the post-rebase stripped state). factory = by_name(load(f"/usr/lib/{kind}")) merged = dict(factory); merged.update(local) # local entries win os.replace(tmp, f"/etc/{kind}")
Lesson — early-boot units and ordering cycles. Symptom: boot hangs and times out into
emergency.target(incident 2026-06-01). Root cause: the seed unit declaredAfter=local-fs.target, creating a cycle throughsystemd-tmpfiles-setup-dev.service; systemd broke the cycle by dropping tmpfiles-setup-dev, so/dev/disk/by-uuid/*never populated and mounts timed out. Fix: order against the minimum you need —/usris part of the immutable commit and available from the start:# /usr/lib/systemd/system/margine-seed-etc-passwd.service DefaultDependencies=no After=local-fs-pre.target Before=systemd-sysusers.service systemd-tmpfiles-setup.service sysinit.target ConditionFileNotEmpty=/usr/lib/passwd
11.6 The rebase path from Bluefin DX
Margine's recommended install today is not the ISO — it is a rebase from a vanilla Bluefin DX install:
# margine-image/README.md — Option A
rpm-ostree rebase ostree-image-signed:docker://ghcr.io/daniel-g-carrasco/margine:stable
systemctl reboot
The ostree-image-signed: transport (vs plain ostree-image: / ostree-unverified-registry:) makes rpm-ostree verify the container signature against the policy in /etc/containers/policy.json before checkout — this is where the cosign key published at margine-image/cosign.pub plugs into the client trust chain (the 2026-06-05 audit lists all three verification paths: signed-transport rebase, direct cosign verify --key cosign.pub, and sha256sum -c on the ISO). After the first Margine boot: one MOK-enrollment reboot (mok-enroll.service submits the import; MokManager confirms it — chapter on Secure Boot), then ujust margine-bootstrap for user-state.
The rebase is also where validate-staged-deployment earns its keep: run after the rebase, before the reboot, it inspects the staged tree from the still-working current OS — OS identity actually says Margine, initramfs exists at the bootc-canonical path and is >50 MB (not host-only), and:
# margine-fedora-atomic/scripts/validate-staged-deployment
# THE check that motivated Bug 5: ostree-prepare-root must be inside
# the initramfs, otherwise switch-root cannot pivot /sysroot ...
if grep -q 'usr/lib/ostree/ostree-prepare-root' "$LS_OUT"; then
ok "initramfs contains ostree-prepare-root (--add ostree fix applied)"
else
bad "initramfs MISSING ostree-prepare-root — this WILL panic at switch-root"
fi
Every check in that script encodes a defect that previously landed a VM in a dracut emergency shell, where copy-pasting diagnostics doesn't work. Catching them pre-reboot — in a terminal that has a clipboard — is the entire design. On failure: rpm-ostree rollback abandons the staged deployment without ever booting it.
11.7 Alternatives & other distros
Image tags / channels
- Universal Blue (Bluefin/Bazzite/Aurora):
:stable+:latest+:stable-daily+ versioned:gts/:41tags, multi-arch manifests — richer grammar, more surface to test. Margine: candidate→stable promotion only. - Fedora bootc / CoreOS: stream refs (
stable,testing,next) with automated promotion windows — same idea as candidate/stable, calendar-driven instead of boot-test-driven. - SteamOS: OTA channels (
stable/beta/preview) over an A/B partition scheme, not OCI — channel switch in the UI; rollback = boot the other slot. ChimeraOS does the same withfrzrdeploying read-only btrfs subvolume images. - Hard digest pinning in fleets: bootc + a GitOps repo that bumps
@sha256:refs (the bootc-fleet pattern; also what Kubernetes folks do with policy-controller) — maximal reproducibility, you own the cadence.
Install media distribution
- Universal Blue: ISOs on a CDN bucket (R2/S3) with SHA256 checksums on the download page — simpler, costs money at scale.
- Fedora: mirror network + torrents via fedoraproject mirrormanager — heavyweight, needs an org. Margine's IA approach is the zero-infra approximation (IA seeds the torrent, keeps every release forever).
- Vanilla OS, NixOS: ISO on GitHub Releases — free, capped at 2 GiB per file, which a Flatpak-baked ISO blows through.
Update orchestration
- uupd (Bluefin/Bazzite/Aurora, inherited by Margine): Go rewrite superseding
ublue-update, which was a Topgrade wrapper — the history Margine recapitulated in miniature (customupdate-all+ topgrade.toml → deleted in favor of uupd). - Plain
rpm-ostree upgrade+rpm-ostreed-automatic.timer(stock Silverblue/Kinoite): base OS only; Flatpaks update via GNOME Software — two cadences, no distrobox/brew coverage. bootc-fetch-apply-updates.timer(Fedora bootc minimal): fetch, apply, auto-reboot — right for servers/appliances, wrong for desktops.- openSUSE MicroOS/Aeon:
transactional-update.timer+ health-checker, btrfs snapshot per update, rollback via snapper — equivalent guarantees, filesystem-level instead of image-level. - Vanilla OS ABRoot: A/B root partitions, update applied to the inactive slot — simple mental model, 2× root disk cost.
- NixOS:
nixos-rebuild switch --upgrade+ generations in the bootloader — config-driven rather than image-driven; rollback selects a generation. - Fleet management: Fleek (Nix-based home/host config sync) or plain Ansible/FluxCD bumping bootc refs; at enterprise scale, RH's image mode + Insights. Margine's fleet is one person, so the "fleet tooling" is ntfy pushes + the staleness watchdog.
Rollback / pinning
- bootc/ostree (Margine): previous deployment +
ostree admin pin— O(1) disk via hardlinks. - MicroOS: snapper rollback across N snapshots — finer-grained history, btrfs-only.
- SteamOS/ChimeraOS: A/B slots — exactly one fallback, zero knobs.
- NixOS: arbitrary generations until GC'd — best history, biggest disk bill.