Handbook · Chapter 7 of 12 · 10 min read

Rechunking: shipping a 14 GB OS as reusable chunks

The build so far produces a working bootc image. This chapter is about making it cheap to ship: how OCI layering interacts with ostree on the client, why the layers buildah emits are hostile to incremental updates, and how Margine re-layers the image with hhd-dev/rechunk before pushing.

7.1 Why naive podman layers churn

Margine's Containerfile has four RUN stages on top of Bluefin DX (/var/home/daniel/dev/margine-image/Containerfile): the CachyOS kernel swap, the build.sh orchestrator, the extensions bake, and bootc container lint. That yields four Margine-owned layers stacked on Bluefin's own layer set, and the result is pathological for updates:

  • Layer identity is the digest of the layer tarball, not of the files. A RUN that re-executes produces a new tar — new mtimes, new inode order — so the layer digest changes even when zero bytes of content changed. Every weekly rebuild (the Sunday cron exists precisely to pick up upstream Bluefin changes) re-runs all four stages.
  • Layers group files by when they were written, not by how often they change. The kernel stage layer contains vmlinuz + all modules + a ~300 MB initramfs; the build.sh layer contains everything from branding PNGs to the offline docs mirror. One changed wallpaper invalidates the whole multi-GB blob.
  • The base is no better. FROM ghcr.io/ublue-os/bluefin-dx:stable means a base rebuild upstream shifts every parent layer digest; the client re-pulls them all even though 95% of the file content is identical.

For a ~14 GB image (Margine bakes ~29 Flatpaks into /var/lib/flatpak, plus a second kernel's worth of modules), "every update is a near-full download" is not acceptable. The fix is to throw away the build-time layer boundaries entirely and re-cut them along content lines.

7.2 What the client does with layers

bootc/ostree clients don't run the container — they import it. Each OCI layer is unpacked into the ostree object store, where files are content-addressed by checksum. Two consequences:

  1. Disk dedup is automatic and file-granular — identical files across deployments are stored once, regardless of layer layout.
  2. Network cost is layer-granular — the client skips any layer blob whose digest it already has, and downloads the rest whole.

So layer layout doesn't affect disk usage, only download size. The goal of rechunking is purely: make layer digests stable across releases so the skip path triggers as often as possible. The same property helps the registry — skopeo copy won't re-upload blobs GHCR already has, so weekly pushes are mostly no-ops too.

7.3 hhd-dev/rechunk: ostree-aware re-layering

hhd-dev/rechunk (built by antheas for Bazzite, now used across Universal Blue) takes the final filesystem of the built image and repacks it:

  1. Flattens the image and commits it into an ostree repo. The commit canonicalizes the tree (zeroed mtimes, normalized ownership, ostree's /usr/etc factory view) — this is what makes output deterministic: same file content in, same chunk digests out.
  2. Re-splits the commit into ~dozens of layers ("chunks") grouped by RPM package ownership and update frequency, instead of by RUN boundary. The kernel and its modules land in their own chunks; GNOME lands in others; rarely-changing Flatpak runtimes in others still.
  3. Emits a fresh OCI image with regenerated ostree metadata (ostree.commit, ostree.linux) and whatever labels/version you declare.

Result: a kernel bump changes the kernel chunks and the commit metadata; everything else keeps its digest from last week, and clients download tens of MB instead of GB. The chunking is content-addressed, not diff-based — there is no "previous image" dependency at pull time, just blob digests that happen to repeat.

The actual invocation

# /var/home/daniel/dev/margine-image/.github/workflows/build.yml (lines 448-464)
- name: ReChunk image
  id: rechunk
  uses: hhd-dev/rechunk@5fbe1d3a639615d2548d83bc888360de6267b1a2 # v1.2.4
  with:
    ref: ${{ env.IMAGE_NAME }}:${{ steps.metadata.outputs.version }}
    version: ${{ env.CANDIDATE_TAG }}.${{ steps.date.outputs.ymd }}
    labels: |
      org.opencontainers.image.title=${{ env.IMAGE_NAME }}
      org.opencontainers.image.description=${{ env.IMAGE_DESC }}
      org.opencontainers.image.source=https://github.com/${{ github.repository }}
      org.opencontainers.image.url=https://github.com/${{ github.repository }}
      org.opencontainers.image.vendor=${{ github.repository_owner }}
      org.opencontainers.image.licenses=Apache-2.0
      io.artifacthub.package.keywords=${{ env.IMAGE_KEYWORDS }}
      io.artifacthub.package.license=Apache-2.0
      containers.bootc=1
    revision: ${{ github.sha }}

Notes on each input:

  • ref — the locally-built localhost/margine:candidate.<...> image from the buildah step. Rechunk reads it out of root containers-storage (sudo podman create), which is why the build step runs sudo buildah build directly instead of a rootless action wrapper:
# build.yml (lines 254-256)
# NOTE: no "Move to root storage" step here — `sudo buildah build`
# above already writes to /var/lib/containers (root storage),
# which is exactly where rechunk's `sudo podman create` looks.

(An earlier iteration built rootless and round-tripped through an oci-archive to move the image; going direct removed that bounce.)

  • version — becomes org.opencontainers.image.version and the version string bootc status shows users, e.g. candidate.20260610. Date-stamped so every build is distinguishable even when content barely changed.
  • labelsre-declared in full. Rechunk writes a fresh manifest; labels applied by buildah at build time do not carry over, so anything you want on the published image must be listed here. containers.bootc=1 marks the image as bootable-container for tooling (Anaconda, bootc itself).
  • revisionorg.opencontainers.image.revision=<git sha>, the exact margine-image commit that produced the artifact.

Pipeline placement

Order in build_push matters and is disk-driven (GitHub's ubuntu-24.04 runners have ~14 GiB free by default; the job starts by freeing ~30 GiB with ublue-os/remove-unwanted-software):

  1. buildah build → local root storage.
  2. First-boot asset validation (blocks rechunk on regression — fail at minute 22, not after a push).
  3. SBOM via podman export + syft dir:pre-rechunk, which is safe:
# build.yml (lines 270-272)
# Pre-rechunk is fine: rechunk doesn't add or remove packages,
# it only repacks the layer boundaries for delta efficiency. The
# SBOM describes the same package inventory either way.
  1. Reclaim the ~14 GB expanded SBOM rootfs — "rechunk needs disk for its own staging" (build.yml lines 436-438).
  2. Rechunk.
  3. Push.

Push and digest capture

The push step copies the rechunked output ref to every tag and captures the manifest digest for the downstream cosign job (signing by digest, never by tag):

# build.yml (lines 483-492, trimmed)
for tag in ${{ steps.metadata.outputs.tags }}; do
sudo skopeo copy --retry-times 3 \
--dest-creds="${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" \
--digestfile=/tmp/digest.txt \
"${{ steps.rechunk.outputs.ref }}" \
"docker://${IMG_FULL}:${tag}"
if [[ -z "$DIGEST" ]]; then
DIGEST="$(cat /tmp/digest.txt)"
fi
done

All tags point at the same manifest; the digest from the first copy is reused. Later, promotion to :stable is skopeo copy --preserve-digests from the candidate digest (chapter 8) — no rebuild, no re-rechunk, the bytes users pull are the bytes that smoke-booted.

7.4 Rechunk is not just an optimization: composefs canonicalization

Margine learned that rechunk's ostree re-commit changes boot semantics, not just download size. Three real incidents, all from /var/home/daniel/dev/margine-fedora-atomic/docs/lessons-learned/.

  • Symptom: first boots died in the initramfs with Failed to switch root: os-release file is missing, even with the file present in the image.
  • Root cause: without rechunk, the published image was not in ostree-canonical form and composefs was not fully set up by the time switch-root read /etc/os-release — the canonical /etc/os-release → ../usr/lib/os-release symlink couldn't be followed. The interim "Fix A" shipped both paths as regular files, which routed around exactly one symptom while anything else depending on early /usr would still fail quietly.
  • Fix: wiring rechunk into build.yml (2026-06-01) re-commits the image into ostree-canonical state, so composefs is up before switch-root — same as upstream Fedora/Bluefin. The workaround was then deleted and the canonical layout restored:
# /var/home/daniel/dev/margine-image/build_files/10-os-identity/install.sh (lines 80-87)
# /usr/lib/os-release — the canonical location written as a regular file.
printf '%s\n' "$OS_RELEASE_CONTENT" > /usr/lib/os-release
chmod 0644 /usr/lib/os-release

# /etc/os-release — relative symlink to the canonical location.
ln -sf ../usr/lib/os-release /etc/os-release

See 2026-06-03-rechunk-and-fixb.md for the wind-down validation (manual build → QEMU smoke-boot → merge).

Lesson: rechunk strips the /etc/passwd seed (Bug 6 v2)

  • Symptom: Layer A validation confirms 65 entries in /etc/passwd at the end of buildah; a fresh VM rebased to the published image has 1. Boot journal fills with Failed to resolve group 'audio'/'kvm'/'tty'; TPM and audio permissions silently break.
  • Root cause: rechunk re-commits the image as an ostree-canonical tree and in doing so strips the build-time-seeded /etc/passwd//etc/group from the /usr/etc factory view (verified 2026-05-31). ostree's 3-way /etc merge on rebase then drops every system user except root and the human account.
  • Fix: stop depending on rechunk preserving /etc at all — ship an idempotent boot-time oneshot that reseeds from the /usr/lib factory copies when /etc/passwd looks stripped:
# build_files/system_files/usr/lib/systemd/system/margine-seed-etc-passwd.service
# Workaround: ship a systemd oneshot that re-applies the seed at
# every boot, before sysinit. Idempotent (only seeds if /etc/passwd
# is below the entry threshold). Doesn't depend on rechunk preserving
# /etc — it doesn't need to.

The unit ordering itself caused a follow-up incident (an After=local-fs.target + Before=systemd-sysusers cycle that systemd broke by disabling systemd-tmpfiles-setup-dev, timing out every .device unit into emergency.target). The corrected ordering is baked into the unit with the rationale inline:

# system_files/.../margine-seed-etc-passwd.service (unit body, comment trimmed)
DefaultDependencies=no
# ... DO NOT add After=local-fs.target: it creates an ordering cycle
# through systemd-tmpfiles-setup-dev.service ... (incident 2026-06-01)
After=local-fs-pre.target
Before=systemd-sysusers.service systemd-tmpfiles-setup.service sysinit.target
ConditionFileNotEmpty=/usr/lib/passwd

Lesson: inherited OCI labels describe the parent, not you

  • Symptom (pre-rechunk era): boot drops to dracut emergency shell at initrd-switch-root.service; bootloader entries point at deployment hashes that don't exist on disk.
  • Root cause: FROM bluefin-dx inherits all of Bluefin's OCI labels, including ostree.linux=<bluefin-kernel-version>. bootc/rpm-ostree consult that label at deploy time to wire the bootloader entry and locate /usr/lib/modules/<label>/ — which no longer existed after the CachyOS kernel swap.
  • Fix: a workflow step rewrote the label from the actual installed kernel (buildah config --label ostree.linux=<kver>); with rechunk in place the ostree metadata labels (ostree.commit, ostree.linux) are regenerated from the re-committed tree, collapsing that whole workaround class. Rule: any derived image that materially changes what an inherited label describes must overwrite it — rechunk does this for the ostree ones by construction.

7.5 zstd:chunked and partial pulls

Layer reuse is coarse: a chunk either matches or is re-downloaded whole. zstd:chunked is the finer-grained complement — a compression format that embeds a table of contents (per-file offsets + digests) in zstd skippable frames. A containers/storage client with partial pulls enabled can fetch only the file ranges it lacks and dedup the rest against local storage; it also plugs directly into composefs. It stays valid zstd, so unaware clients just decompress normally (unlike eStargz, which plays the same trick inside gzip for containerd's lazy-pull snapshotter — a Kubernetes fast-start tool, not a bootc one).

Margine's skopeo copy push does not currently set --dest-compress-format zstd:chunked; the delta efficiency comes from rechunk's stable layer digests alone. The two compose — rechunk decides what the blobs are, zstd:chunked makes each blob partially fetchable — and zstd:chunked is the obvious next increment, since Fedora's own bootc base images and the bootc client stack are converging on it.

7.6 Alternatives & other distros

  • hhd-dev/rechunk (Margine, Bazzite, Bluefin, Aurora): ostree-aware re-layering, stable chunks, deterministic output. Cost: an extra ~minutes CI step, root storage + disk staging, and it rewrites your manifest (labels must be re-declared, /etc factory handling can surprise you — §7.4).
  • Plain Containerfile layers (early uBlue images, most homelab bootc derivatives): zero extra tooling, buildah cache works during builds — but every release re-downloads the fat RUN layers; fine for small images, painful past a few GB.
  • rpm-ostree compose image / ostree container encapsulate (stock Fedora Silverblue, Kinoite, IoT, CoreOS): composes from a treefile and emits an OCI image with built-in package-aware chunking (capped layer count, files grouped by change frequency) — the same idea as rechunk, but it requires owning the compose; it doesn't apply to a FROM-based derived build.
  • Flatten to one layer (podman build --squash): simplest possible artifact, kills all reuse; every update is a full-image download. Only defensible for tiny images or air-gapped one-shot delivery.
  • estargz + stargz-snapshotter (containerd/k8s world): lazy pulls — start before the image finishes downloading. Solves container startup latency, not OS update deltas; no bootc integration.
  • zstd:chunked (Fedora bootc base images, podman ecosystem direction): per-file TOC, partial pulls, composefs-friendly local dedup; complementary to rechunk rather than a replacement.
  • ostree static deltas over plain HTTP (Endless OS, pre-OCI Fedora Atomic): server-precomputed binary deltas between commits — excellent download efficiency, but you run an ostree repo server instead of reusing registry infrastructure.
  • openSUSE MicroOS/Aeon: no image artifact at all — transactional-update installs RPMs into a new btrfs snapshot; deltas are RPM-granular, but the result is assembled per-machine rather than tested-as-built.
  • Vanilla OS (ABRoot v2): OCI images applied to A/B root partitions; registry-based like bootc but partition-image semantics, without ostree's file-level store dedup.
  • ChimeraOS (frzr): full root images as btrfs-subvolume tarballs from GitHub releases; dead simple, every update is a full download.
  • NixOS: sidesteps the problem — there is no monolithic image; the store path is the dedup unit and nix copy substitutes only missing derivations. Finest granularity of the lot, at the price of an entirely different model.

7.7 Takeaways

  • OCI layer digests are tar digests; rebuild churn is structural, not a buildah bug. Re-layer by content, not by RUN order.
  • Rechunk earns its place twice in Margine: small weekly downloads, and ostree-canonical commits that made composefs boot timing match upstream (retiring two boot workarounds).
  • It is also a manifest rewrite: re-declare labels, re-verify /etc factory behavior, and keep the smoke-boot gate (chapter 8) downstream of it — the artifact you test must be the post-rechunk one, and --preserve-digests promotion guarantees it's also the one users get.