Handbook · Chapter 1 of 12 · 16 min read

The atomic, image-based OS model

Margine is not "a Fedora with packages preinstalled". It is an OCI container image that boots. The running system is a read-only checkout of that image; updating means downloading the next image and rebooting into it; a broken update means rebooting into the previous one. This chapter explains the machinery underneath — ostree, deployments, the bootc transport, the three-zone filesystem contract — and why this model was chosen over the half-dozen other ways to build an atomic distro.

The whole product fits in one sentence from the top of the image repo:

# ----- Base: Bluefin DX (Fedora 44 track, "stable" tag) -----
FROM ghcr.io/ublue-os/bluefin-dx:stable

/var/home/daniel/dev/margine-image/Containerfile (line 32)

A distro is a FROM line plus deltas. Everything else in this handbook is about making those deltas correct, signed, and bootable.

1.1 Mutable vs image-based

A traditional package-managed system (dnf, pacman, apt) mutates the live root filesystem in place. Consequences:

  • every machine is a unique snowflake: install order, partial upgrades, leftover config;
  • a failed mid-transaction upgrade leaves the system in an undefined state;
  • "rollback" means restoring from backup or downgrade gymnastics;
  • you cannot test "the OS" in CI, because there is no single artifact that is the OS.

The image-based model inverts this. The OS is built once, centrally, as an immutable artifact. Machines deploy that artifact and never modify it. State that must vary per machine is confined to explicitly writable zones. The practical payoffs:

  • Atomicity: an update either fully applies or doesn't exist. There is no half-upgraded state — the new deployment is assembled completely on disk before the bootloader ever points at it.
  • Rollback: the previous deployment is kept; one boot-menu entry (or bootc rollback) returns to it byte-for-byte.
  • Testability: Margine's CI boots the exact artifact in QEMU before tagging it :stable (chapter on CI). The bytes a user pulls are the bytes that passed the boot test.
  • Fleet identity: every machine on the same digest runs the same /usr. Bug reports become reproducible.

1.2 ostree: a content-addressed object store for filesystems

ostree is "git for operating system binaries". The on-disk layout under /ostree:

  • /ostree/repo/objects/ — a content-addressed store: every file is stored once under its checksum, like git blobs.
  • Commits — a commit is a complete filesystem tree (metadata + dirtree objects pointing into the object store), identified by a checksum.
  • /ostree/deploy/<stateroot>/deploy/<commit>.<serial>/deployments: hardlink checkouts of a commit. Files are hardlinks into the object store, so ten deployments of nearly-identical trees cost roughly one tree of disk.

At boot, the initramfs ostree module (more on why that matters in the kernel chapter) reads the ostree= karg, bind-mounts the chosen deployment as /, the real disk root at /sysroot, and mounts the OS content read-only. On Fedora 39+ this is fronted by composefs: instead of trusting the hardlink farm directly, an erofs+overlay view is constructed over the object store, which makes the root tamper-evident and removes the "someone ran chattr -i and edited a hardlinked object" hole. A side effect that trips up validators: /usr no longer has its own mountpoint. Margine's layout validator handles exactly this:

# On Silverblue with composefs (Fedora 39+), /usr is embedded in the root
# overlay and has no separate mountpoint. This is expected and correct.
if findmnt /usr >/dev/null 2>&1; then
  ...
else
  root_fstype_inner=$(mount_field FSTYPE /)
  if [[ "$root_fstype_inner" == "overlay" ]]; then
    ok "/usr is embedded in the composefs root overlay (expected on Silverblue)"

/home/daniel/dev/margine-fedora-atomic/scripts/validate-atomic-layout (lines 113-123)

Practical effect: do not write health checks that assert findmnt /usr — on a composefs system / is an overlay and /usr is inside it.

1.3 Deployments, staged updates, rollback

A machine keeps multiple deployments (booted, rollback, optionally pinned via ostree admin pin). The update lifecycle:

  1. Fetch: bootc upgrade (or rpm-ostree upgrade) pulls the new image/commit. The live system is untouched.
  2. Stage: the new deployment is checked out under /ostree/deploy/..., its /etc is produced by the 3-way merge (§1.4), and it is marked staged. ostree-finalize-staged.service writes the bootloader entry at clean shutdown — the very last moment, so a crash mid-update leaves the old bootloader config intact.
  3. Reboot: the bootloader's default entry is the new deployment. The old one remains as the second menu entry.
  4. Rollback: bootc rollback swaps the boot order back; or pick the older entry in GRUB by hand. Nothing is rebuilt — the old tree never left the object store.

Two asymmetries to internalize: /etc rolls back with the deployment (each deployment carries its own merged /etc), but /var never rolls back — treat /var schema changes like a database during a blue/green deploy, compatible in both directions. And the staged vs pending distinction looks like a bug the first time you meet it: after bootc switch, ls /boot/loader/entries/ shows nothing new. Margine's pre-reboot validator documents why:

# Distinguish "staged" (bootc switch — finalized by
# ostree-finalize-staged.service at shutdown, BLS entries appear THEN)
# from "pending" (rpm-ostree rebase — BLS entries written immediately).
...
if [[ "$IS_STAGED" == "true" ]]; then
  info "Deployment is STAGED (bootc switch flow). BLS entries are not"
  info "rewritten now — ostree-finalize-staged.service does that at the"
  info "next shutdown, so GRUB sees the new entry on the boot AFTER."
  ok "BLS entry update is correctly deferred (this is normal)"

/home/daniel/dev/margine-fedora-atomic/scripts/validate-staged-deployment (lines 80-83, 237-241)

Because a staged deployment is inert until reboot, it can be audited from the running system: that same validator locates the checkout under /ostree/deploy/*/deploy/<hash>.* and inspects its os-release identity, initramfs contents, kernel signature, and bootloader wiring — every defect that would otherwise greet you in a dracut emergency shell is caught while you still have a working terminal to debug from.

The deployment a machine is running is fully described by bootc status --json. Margine uses this to tell the user what their reboot actually did:

r = subprocess.run(["bootc", "status", "--json"], capture_output=True, text=True, timeout=15)
if r.returncode != 0:
    sys.exit(0)
booted = json.loads(r.stdout)["status"]["booted"]
digest  = booted["image"].get("imageDigest", "?")
version = booted["image"].get("version", "?")

/var/home/daniel/dev/margine-image/build_files/system_files/usr/libexec/margine-upgrade-notify

The booted OS is identified by an OCI digest — the same identifier CI signed and smoke-booted. That one-to-one mapping between "what runs on the laptop" and "what passed the pipeline" is the core operational win of the model.

Rollback is the user-side safety net; Margine adds a distro-side one: builds publish to :candidate, and only a QEMU boot that reaches multi-user gets promoted to :stable via skopeo copy --preserve-digests (details in the CI chapter). Per the 2026-06-01 lessons-learned: ":stable no longer means 'the last build that compiled'; it means 'the last build that booted to a usable state inside QEMU'".

1.4 The three-zone filesystem contract

The whole model rests on a strict split of the filesystem, documented in Margine's architecture doc:

Path Role
/ deployment root
/usr operating system content, read-only in normal operation
/etc writable host configuration with ostree merge behavior
/var writable persistent local state
/home symlink to /var/home
/opt symlink to /var/opt
/usr/local symlink to /var/usrlocal

(table from /home/daniel/dev/margine-fedora-atomic/docs/01-architecture.md)

/usr — image-owned, read-only

Everything the distro ships lives in /usr and is immutable at runtime. The build-time corollary: all customization in this handbook — systemd units, GNOME extensions, branding, tuned profiles — is written into /usr during the container build, never at runtime on the machine.

Lesson — legacy units assume a remountable root (Bug 8) Symptom: every boot, on Margine and stock Bluefin DX, systemctl --failed shows systemd-remount-fs.service failed: mount: /: fsconfig() failed: overlay: No changes allowed in reconfigure. Root cause: the unit is a pre-atomic relic — remount / rw per fstab after fsck. On a composefs root, / is an overlay the kernel refuses to reconfigure, and it is already rw via the upper layer; the unit is useless noise here. Fix: mask it at build time so a clean boot has zero failed units, turning any future systemctl --failed output into a real signal:

ln -sf /dev/null /etc/systemd/system/systemd-remount-fs.service

/home/daniel/dev/margine-fedora-atomic/docs/lessons-learned/2026-05-28-initramfs-and-bootc-labels.md (Bug 8; applied in build_files/60-ujust-services/install.sh)

/etc — the 3-way merge

/etc is writable, but it is not simply "persisted". Each image ships a factory copy at /usr/etc. On every deployment, ostree computes the new /etc as a 3-way merge:

  • new factory defaults (/usr/etc of the new image), plus
  • the local diff (current /etc minus the previous image's /usr/etc).

Files the admin never touched track new image defaults; files the admin modified keep the local version (file granularity — no intra-file merging). ostree admin config-diff lists the local delta. Design consequence for image builders: defaults you want to be upgradeable belong in /usr (e.g. /usr/lib/systemd/system, dconf db under /etc/dconf/db compiled from /usr-shipped keyfiles), and /etc content baked into the image should be minimal, because it becomes "factory" state subject to merge semantics.

Lesson — /etc/passwd vanished after rebase (Bug 6) Symptom: CI validation confirmed 65 entries in the image's /etc/passwd; a fresh VM rebased to the image had 1. System users (gdm, polkitd, ...) gone, services failing. Root cause: the rechunk step (§1.5) re-commits the image into ostree-canonical form and strips /etc/passwd//etc/group from /usr/etc — so the factory side of the 3-way merge has nothing to merge. Fix: a boot-time idempotent seed from the /usr/lib factory copies, shipped as a sysinit.target oneshot:

# Workaround: ship a systemd oneshot that re-applies the seed at
# every boot, before sysinit. Idempotent (only seeds if /etc/passwd
# is below the entry threshold). Doesn't depend on rechunk preserving
# /etc — it doesn't need to.

/var/home/daniel/dev/margine-image/build_files/system_files/usr/lib/systemd/system/margine-seed-etc-passwd.service; merge logic in /usr/libexec/margine-seed-etc-passwd

Lesson — early-boot unit ordering deadlocked the boot (incident 2026-06-01) Symptom: fresh VM stalled into emergency.target; journal showed local-fs-pre.target: Found ordering cycle and every /dev/disk/by-uuid/* device timing out. Root cause: the passwd-seed unit declared After=local-fs.target and Before=systemd-sysusers.service. local-fs.target transitively depends on systemd-tmpfiles-setup-dev.service, which sits in the same chain — a closed loop. systemd broke the cycle by disabling tmpfiles-setup-dev, so /dev/disk/by-uuid symlinks never appeared. Fix: in an ostree system /etc and /usr are part of the deployment and exist before any local-fs unit — local-fs-pre.target is sufficient:

 DefaultDependencies=no
-Before=sysinit.target systemd-sysusers.service systemd-tmpfiles-setup.service
-After=local-fs.target
+Before=systemd-sysusers.service systemd-tmpfiles-setup.service sysinit.target
+After=local-fs-pre.target

/home/daniel/dev/margine-fedora-atomic/docs/lessons-learned/2026-06-01-systemd-ordering-cycle-and-rechunk-storage.md

Follow-up hardening: CI now runs SYSTEMD_OFFLINE=1 systemd-analyze verify default.target inside every image before push — this bug class is statically detectable.

/var — machine-local, never shipped

/var belongs to the machine, not the image. ostree/bootc populate it once (from systemd-tmpfiles factories) and never touch it again — and conversely, anything an installer environment puts in its own /var does not survive into the deployed system. Margine hits this head-on with its preinstalled Flatpaks (which live in /var/lib/flatpak):

# This kickstart's only job is to rsync the populated
# /var/lib/flatpak from the installer rootfs to the target's
# /var/lib/flatpak. ostree+bootc reset /var per-deployment when
# they install, so without this rsync the Flatpaks would be lost
# at first reboot.
...
rsync -aAXUHKP --filter='-x security.selinux' /var/lib/flatpak "$DEPLOY_DIR/var/lib/"

/var/home/daniel/dev/margine-image/live-env/src/anaconda/post-scripts/install-flatpaks.ks

Rule of thumb when designing a feature: if it must survive updates and differ per machine → /var; if it is host configuration → /etc; everything else → /usr at build time.

1.5 bootc: the OCI image as the OS transport

Classic rpm-ostree distros (Silverblue circa Fedora 33) pulled commits from a dedicated ostree remote — distro-hosted infrastructure speaking the ostree wire format, with static deltas generated server-side. bootc replaces the transport: the ostree commit is encapsulated in a standard OCI container image, pushed to any container registry, and the client (bootc upgrade / bootc switch) pulls it like any container. Internally it is still ostree — layers unpack into the same object store, deployments work identically — but the distribution problem is outsourced to registries.

This makes "building a distro" literally a container build. Margine's entire image is a four-RUN Containerfile ending with a structural lint:

RUN --mount=type=bind,from=ctx,source=/,target=/ctx \
    --mount=type=cache,dst=/var/cache \
    --mount=type=tmpfs,dst=/tmp \
    /ctx/build.sh

# ----- Lint: verify final image is a valid bootc container -----
RUN bootc container lint

/var/home/daniel/dev/margine-image/Containerfile (lines 49-53 trimmed, 69-70)

bootc container lint fails the build if the image violates bootc invariants (content in /var, missing kernel layout, bad /usr structure) — the cheapest possible guardrail, run before any artifact leaves the builder.

Switching a machine onto (or between) images is one command. Margine's installer wires the freshly installed system to the registry so future bootc upgrade calls track the published tag:

%post --erroronfail
# Point the freshly installed system at our public registry so
# subsequent `bootc upgrade` calls follow margine:stable.
bootc switch --mutate-in-place --transport registry ghcr.io/daniel-g-carrasco/margine:stable
%end

/var/home/daniel/dev/margine-image/live-env/src/anaconda/post-scripts/bootc-switch.ks

And the documented adoption path for an existing Fedora Atomic / Bluefin machine, from the Containerfile header:

rpm-ostree rebase ostree-image-signed:docker://ghcr.io/daniel-g-carrasco/margine:stable

/var/home/daniel/dev/margine-image/Containerfile (line 17)

rpm-ostree rebase and bootc switch are two clients of the same mechanism: repoint the origin, stage a deployment, reboot. The ostree-image-signed: prefix enforces signature policy from /etc/containers/policy.json (signing chapter).

Rechunking: making OCI layers behave like ostree deltas

Naive Containerfile layering is hostile to updates: any change in an early RUN invalidates every later layer, so users re-download gigabytes for a one-package bump. Margine repacks the final image with hhd-dev/rechunk, which splits content into stable, content-defined chunks (kernel, big packages, shared data each in their own layer) so unchanged chunks dedupe across releases:

- name: ReChunk image
  id: rechunk
  uses: hhd-dev/rechunk@5fbe1d3a639615d2548d83bc888360de6267b1a2 # v1.2.4
  with:
    ref: ${{ env.IMAGE_NAME }}:${{ steps.metadata.outputs.version }}
    version: ${{ env.CANDIDATE_TAG }}.${{ steps.date.outputs.ymd }}
    labels: |
      ...
      containers.bootc=1
    revision: ${{ github.sha }}

/var/home/daniel/dev/margine-image/.github/workflows/build.yml (lines 448-464, trimmed)

Practical effect: day-to-day bootc upgrade downloads shrink from "most of the image" to "the layers that actually changed", approximating ostree static deltas on plain registry infrastructure.

Lesson — os-release symlink vs composefs timing (Fix A wind-down) Symptom: early Margine builds failed boot with os-release file is missing/etc/os-release → ../usr/lib/os-release could not resolve because composefs was not fully assembled when switch-root read it; the image's commit metadata was also inherited from Bluefin rather than regenerated. Root cause: a buildah-produced image is not in ostree-canonical form; ordering assumptions that hold on Fedora/Bluefin images broke. Fix (initial): write os-release as a regular file ("Fix A"). Fix (final): rechunk re-commits the image into ostree-canonical state, composefs is fully set up before switch-root, and the canonical symlink was restored — deleting workaround surface instead of accumulating it. /home/daniel/dev/margine-fedora-atomic/docs/lessons-learned/2026-06-03-rechunk-and-fixb.md

Where this is heading: sealed images

ADR 0007 tracks the next step of the model: Sealed Bootable Container Images (systemd-boot + UKI + composefs with fs-verity, every /usr page-read verified against a vendor-signed Merkle root). It changes the signing story substantially — UKI signing replaces per-module sign-file, the MOK enrollment dance disappears, GRUB goes away — and Margine deliberately waits for upstream (trigger: Bluefin/Bazzite shipping sealed :stable). See /home/daniel/dev/margine-fedora-atomic/docs/adr/0007-sealed-bootable-images-tracker.md.

1.6 Comparing the atomic models

rpm-ostree-native vs bootc

rpm-ostree-native (ostree remote) bootc (OCI)
Transport distro-hosted ostree repo + static deltas any OCI registry
Build tooling rpm-ostree compose (treefile), distro infra Containerfile + buildah/podman, any CI
Signing GPG on commits sigstore/cosign on image digests
Derivation hard (re-compose) trivial (FROM + RUN)
Client-side package layering yes (rpm-ostree install) discouraged; bake into image instead
Hosting cost you run the repo GitHub/quay run the registry

Fedora Atomic today is a hybrid: bootc transport, rpm-ostree still present for layering. Margine's stance (ADR 0005, docs/01-architecture.md): no runtime layering as policy — "repeated host helpers should later move into a native image or bootc build" — because every layered package re-applies on each upgrade and reintroduces per-machine drift.

The other atomic architectures

  • ABRoot (Vanilla OS 2) — two root partitions; transactions are applied from an OCI image to the inactive root, bootloader flips on reboot. OCI-based like bootc but partition-granular: 2× root disk cost, no content dedup between roots, simpler mental model.
  • transactional-update + btrfs/snapper (openSUSE MicroOS/Aeon/Kalpa)zypper runs inside a new btrfs snapshot which becomes the default subvolume on reboot; rollback = boot an older snapshot. Atomic updates but not image-based: each machine still runs a package manager, so fleets drift; there is no single testable artifact. Contrast with Margine's explicit stance: "System rollback comes from ostree/rpm-ostree deployments, not from a custom Btrfs snapshot scheme" (docs/01-architecture.md).
  • NixOS generations — declarative config evaluated into immutable /nix/store closures; every rebuild is a bootloader generation, rollback is free. The most expressive model, and the system is its config — at the cost of an entirely parallel packaging ecosystem (no FHS, patchelf/wrappers for foreign binaries) and a steep language. ostree tracks trees; Nix tracks build graphs.
  • A/B partition slots (ChromeOS, Android, SteamOS 3, Flatcar) — full image written to the inactive slot, bootloader flips, failed boots auto-revert (boot counters). Maximally robust and verifiable (dm-verity per slot), but 2× space, fixed OS size, and OS customization is essentially unsupported — SteamOS makes / writable only via a "developer mode" that updates then wipe.
  • frzr (ChimeraOS) — image tarballs deployed into btrfs subvolumes, bootloader points at the active one. A/B semantics with snapshot-level dedup; niche tooling.

1.7 Why Universal Blue (and Margine) picked OCI

uBlue's bet — inherited wholesale by everything FROM ghcr.io/ublue-os/* — comes down to using infrastructure that already exists at planet scale:

  1. Registry infrastructure is free and ubiquitous. GHCR/quay host the artifacts, handle bandwidth, auth, and tag immutability. An ostree remote with static deltas is bespoke infrastructure a hobby distro cannot realistically operate; Margine ships from a personal GitHub account.
  2. Layer dedup ≈ delta updates. OCI layers (especially after rechunking, §1.5) give incremental downloads without server-side delta generation. Bonus: FROM bluefin-dx means Margine users share base layers with every other uBlue derivative on their disk and on the registry.
  3. The signing ecosystem already exists. cosign signs by digest, policy.json enforces at pull, SBOMs attach as OCI referrers. Margine's pipeline (build → syft SBOM → rechunk → push → cosign sign-by-digest) is standard container supply-chain tooling, not distro-specific machinery (CI chapter).
  4. The toolchain is the container toolchain. Containerfiles, buildah, BuildKit secrets (Margine's MOK keys enter the build as --mount=type=secret and never persist in a layer — see the Containerfile lines 39-46), GitHub Actions, skopeo, podman run for inspection. Every contributor who has built a container can derive a distro. This is the whole "custom image" community model: Bazzite, Bluefin, Aurora, and hundreds of personal images are Containerfiles in public repos.

The trade-off accepted: OCI was not designed to carry bootable filesystems — hence rechunk, bootc container lint, and the canonical-form lessons of §1.5. The friction is real but front-loaded onto the image builder; the user-facing mechanics (staged deployments, 3-way merge, rollback) remain pure ostree.

Alternatives & other distros

Approach Used by One-line trade-off
bootc / OCI image on ostree Margine, Bluefin, Bazzite, Aurora, uCore registry-native, derivable via FROM, cosign signing; needs rechunk for delta-efficient updates
rpm-ostree-native (ostree remote) Fedora Silverblue/Kinoite (classic path), Endless OS, Fedora CoreOS proven deltas + GPG, but distro-hosted infra and hard derivation
ABRoot (OCI → A/B root partitions) Vanilla OS 2 atomic + OCI-sourced, partition granularity: 2× root space, no object-store dedup
transactional-update + btrfs/snapper openSUSE MicroOS, Aeon, Kalpa atomic updates with a real package manager kept; per-machine drift, no single testable artifact
Nix generations NixOS fully declarative system-as-config, free rollback; parallel ecosystem, steep learning curve
A/B slots + dm-verity ChromeOS, Android, SteamOS 3, Flatcar auto-revert on boot failure, strongest integrity; 2× space, OS effectively closed to customization
btrfs deployment images (frzr) ChimeraOS simple A/B-on-btrfs with dedup; small ecosystem, image-tarball transport
systemd-sysupdate + UKI partition images GNOME OS, ParticleOS systemd-native A/B with measured boot; young tooling, no derivation story comparable to FROM
swupd manifest/bundle deltas Clear Linux (discontinued 2025) fine-grained per-file deltas without reboot atomicity; bespoke infra died with the distro
Sealed bootable containers (UKI + composefs/fs-verity) Fedora/bootc test images, future Bluefin — tracked in Margine ADR 0007 fully verified boot chain and sane TPM2 defaults; immature, breaks current MOK/GRUB pipelines

Margine sits in row one deliberately: it inherits Bluefin DX's maintenance (codecs, Mesa, virt stack — ADR 0005's "stop hand-rolling 70% of Bluefin") and spends its own effort only on the deltas the next chapters cover: a signed CachyOS kernel, GNOME defaults, branding, an installer, and a CI gate that refuses to ship an image that didn't boot.