Handbook · Chapter 7 of 12 · 10 min read
Rechunking: shipping a 14 GB OS as reusable chunks
The build so far produces a working bootc image. This chapter is about making it
cheap to ship: how OCI layering interacts with ostree on the client, why the
layers buildah emits are hostile to incremental updates, and how Margine
re-layers the image with hhd-dev/rechunk before pushing.
7.1 Why naive podman layers churn
Margine's Containerfile has four RUN stages on top of Bluefin DX
(/var/home/daniel/dev/margine-image/Containerfile): the CachyOS kernel swap,
the build.sh orchestrator, the extensions bake, and bootc container lint.
That yields four Margine-owned layers stacked on Bluefin's own layer set, and
the result is pathological for updates:
- Layer identity is the digest of the layer tarball, not of the files.
A
RUNthat re-executes produces a new tar — new mtimes, new inode order — so the layer digest changes even when zero bytes of content changed. Every weekly rebuild (the Sunday cron exists precisely to pick up upstream Bluefin changes) re-runs all four stages. - Layers group files by when they were written, not by how often they
change. The kernel stage layer contains vmlinuz + all modules + a ~300 MB
initramfs; the
build.shlayer contains everything from branding PNGs to the offline docs mirror. One changed wallpaper invalidates the whole multi-GB blob. - The base is no better.
FROM ghcr.io/ublue-os/bluefin-dx:stablemeans a base rebuild upstream shifts every parent layer digest; the client re-pulls them all even though 95% of the file content is identical.
For a ~14 GB image (Margine bakes ~29 Flatpaks into /var/lib/flatpak, plus a
second kernel's worth of modules), "every update is a near-full download" is
not acceptable. The fix is to throw away the build-time layer boundaries
entirely and re-cut them along content lines.
7.2 What the client does with layers
bootc/ostree clients don't run the container — they import it. Each OCI layer is unpacked into the ostree object store, where files are content-addressed by checksum. Two consequences:
- Disk dedup is automatic and file-granular — identical files across deployments are stored once, regardless of layer layout.
- Network cost is layer-granular — the client skips any layer blob whose digest it already has, and downloads the rest whole.
So layer layout doesn't affect disk usage, only download size. The goal of
rechunking is purely: make layer digests stable across releases so the skip
path triggers as often as possible. The same property helps the registry —
skopeo copy won't re-upload blobs GHCR already has, so weekly pushes are
mostly no-ops too.
7.3 hhd-dev/rechunk: ostree-aware re-layering
hhd-dev/rechunk (built by antheas for
Bazzite, now used across Universal Blue) takes the final filesystem of the
built image and repacks it:
- Flattens the image and commits it into an ostree repo. The commit
canonicalizes the tree (zeroed mtimes, normalized ownership, ostree's
/usr/etcfactory view) — this is what makes output deterministic: same file content in, same chunk digests out. - Re-splits the commit into ~dozens of layers ("chunks") grouped by RPM
package ownership and update frequency, instead of by
RUNboundary. The kernel and its modules land in their own chunks; GNOME lands in others; rarely-changing Flatpak runtimes in others still. - Emits a fresh OCI image with regenerated ostree metadata
(
ostree.commit,ostree.linux) and whatever labels/version you declare.
Result: a kernel bump changes the kernel chunks and the commit metadata; everything else keeps its digest from last week, and clients download tens of MB instead of GB. The chunking is content-addressed, not diff-based — there is no "previous image" dependency at pull time, just blob digests that happen to repeat.
The actual invocation
# /var/home/daniel/dev/margine-image/.github/workflows/build.yml (lines 448-464)
- name: ReChunk image
id: rechunk
uses: hhd-dev/rechunk@5fbe1d3a639615d2548d83bc888360de6267b1a2 # v1.2.4
with:
ref: ${{ env.IMAGE_NAME }}:${{ steps.metadata.outputs.version }}
version: ${{ env.CANDIDATE_TAG }}.${{ steps.date.outputs.ymd }}
labels: |
org.opencontainers.image.title=${{ env.IMAGE_NAME }}
org.opencontainers.image.description=${{ env.IMAGE_DESC }}
org.opencontainers.image.source=https://github.com/${{ github.repository }}
org.opencontainers.image.url=https://github.com/${{ github.repository }}
org.opencontainers.image.vendor=${{ github.repository_owner }}
org.opencontainers.image.licenses=Apache-2.0
io.artifacthub.package.keywords=${{ env.IMAGE_KEYWORDS }}
io.artifacthub.package.license=Apache-2.0
containers.bootc=1
revision: ${{ github.sha }}
Notes on each input:
ref— the locally-builtlocalhost/margine:candidate.<...>image from the buildah step. Rechunk reads it out of root containers-storage (sudo podman create), which is why the build step runssudo buildah builddirectly instead of a rootless action wrapper:
# build.yml (lines 254-256)
# NOTE: no "Move to root storage" step here — `sudo buildah build`
# above already writes to /var/lib/containers (root storage),
# which is exactly where rechunk's `sudo podman create` looks.
(An earlier iteration built rootless and round-tripped through an oci-archive to move the image; going direct removed that bounce.)
version— becomesorg.opencontainers.image.versionand the version stringbootc statusshows users, e.g.candidate.20260610. Date-stamped so every build is distinguishable even when content barely changed.labels— re-declared in full. Rechunk writes a fresh manifest; labels applied by buildah at build time do not carry over, so anything you want on the published image must be listed here.containers.bootc=1marks the image as bootable-container for tooling (Anaconda, bootc itself).revision—org.opencontainers.image.revision=<git sha>, the exact margine-image commit that produced the artifact.
Pipeline placement
Order in build_push matters and is disk-driven (GitHub's ubuntu-24.04
runners have ~14 GiB free by default; the job starts by freeing ~30 GiB with
ublue-os/remove-unwanted-software):
buildah build→ local root storage.- First-boot asset validation (blocks rechunk on regression — fail at minute 22, not after a push).
- SBOM via
podman export+syft dir:— pre-rechunk, which is safe:
# build.yml (lines 270-272)
# Pre-rechunk is fine: rechunk doesn't add or remove packages,
# it only repacks the layer boundaries for delta efficiency. The
# SBOM describes the same package inventory either way.
- Reclaim the ~14 GB expanded SBOM rootfs — "rechunk needs disk for its own
staging" (
build.ymllines 436-438). - Rechunk.
- Push.
Push and digest capture
The push step copies the rechunked output ref to every tag and captures the manifest digest for the downstream cosign job (signing by digest, never by tag):
# build.yml (lines 483-492, trimmed)
for tag in ${{ steps.metadata.outputs.tags }}; do
sudo skopeo copy --retry-times 3 \
--dest-creds="${{ github.actor }}:${{ secrets.GITHUB_TOKEN }}" \
--digestfile=/tmp/digest.txt \
"${{ steps.rechunk.outputs.ref }}" \
"docker://${IMG_FULL}:${tag}"
if [[ -z "$DIGEST" ]]; then
DIGEST="$(cat /tmp/digest.txt)"
fi
done
All tags point at the same manifest; the digest from the first copy is reused.
Later, promotion to :stable is skopeo copy --preserve-digests from the
candidate digest (chapter 8) — no rebuild, no re-rechunk, the bytes users pull
are the bytes that smoke-booted.
7.4 Rechunk is not just an optimization: composefs canonicalization
Margine learned that rechunk's ostree re-commit changes boot semantics, not
just download size. Three real incidents, all from
/var/home/daniel/dev/margine-fedora-atomic/docs/lessons-learned/.
Lesson: os-release symlink unreadable at switch-root (Fix A wind-down)
- Symptom: first boots died in the initramfs with
Failed to switch root: os-release file is missing, even with the file present in the image. - Root cause: without rechunk, the published image was not in
ostree-canonical form and composefs was not fully set up by the time
switch-root read
/etc/os-release— the canonical/etc/os-release → ../usr/lib/os-releasesymlink couldn't be followed. The interim "Fix A" shipped both paths as regular files, which routed around exactly one symptom while anything else depending on early/usrwould still fail quietly. - Fix: wiring rechunk into
build.yml(2026-06-01) re-commits the image into ostree-canonical state, so composefs is up before switch-root — same as upstream Fedora/Bluefin. The workaround was then deleted and the canonical layout restored:
# /var/home/daniel/dev/margine-image/build_files/10-os-identity/install.sh (lines 80-87)
# /usr/lib/os-release — the canonical location written as a regular file.
printf '%s\n' "$OS_RELEASE_CONTENT" > /usr/lib/os-release
chmod 0644 /usr/lib/os-release
# /etc/os-release — relative symlink to the canonical location.
ln -sf ../usr/lib/os-release /etc/os-release
See 2026-06-03-rechunk-and-fixb.md for the wind-down validation (manual
build → QEMU smoke-boot → merge).
Lesson: rechunk strips the /etc/passwd seed (Bug 6 v2)
- Symptom: Layer A validation confirms 65 entries in
/etc/passwdat the end of buildah; a fresh VM rebased to the published image has 1. Boot journal fills withFailed to resolve group 'audio'/'kvm'/'tty'; TPM and audio permissions silently break. - Root cause: rechunk re-commits the image as an ostree-canonical tree and
in doing so strips the build-time-seeded
/etc/passwd//etc/groupfrom the/usr/etcfactory view (verified 2026-05-31). ostree's 3-way/etcmerge on rebase then drops every system user exceptrootand the human account. - Fix: stop depending on rechunk preserving
/etcat all — ship an idempotent boot-time oneshot that reseeds from the/usr/libfactory copies when/etc/passwdlooks stripped:
# build_files/system_files/usr/lib/systemd/system/margine-seed-etc-passwd.service
# Workaround: ship a systemd oneshot that re-applies the seed at
# every boot, before sysinit. Idempotent (only seeds if /etc/passwd
# is below the entry threshold). Doesn't depend on rechunk preserving
# /etc — it doesn't need to.
The unit ordering itself caused a follow-up incident (an
After=local-fs.target + Before=systemd-sysusers cycle that systemd broke
by disabling systemd-tmpfiles-setup-dev, timing out every .device unit
into emergency.target). The corrected ordering is baked into the unit with
the rationale inline:
# system_files/.../margine-seed-etc-passwd.service (unit body, comment trimmed)
DefaultDependencies=no
# ... DO NOT add After=local-fs.target: it creates an ordering cycle
# through systemd-tmpfiles-setup-dev.service ... (incident 2026-06-01)
After=local-fs-pre.target
Before=systemd-sysusers.service systemd-tmpfiles-setup.service sysinit.target
ConditionFileNotEmpty=/usr/lib/passwd
Lesson: inherited OCI labels describe the parent, not you
- Symptom (pre-rechunk era): boot drops to dracut emergency shell at
initrd-switch-root.service; bootloader entries point at deployment hashes that don't exist on disk. - Root cause:
FROM bluefin-dxinherits all of Bluefin's OCI labels, includingostree.linux=<bluefin-kernel-version>. bootc/rpm-ostree consult that label at deploy time to wire the bootloader entry and locate/usr/lib/modules/<label>/— which no longer existed after the CachyOS kernel swap. - Fix: a workflow step rewrote the label from the actual installed kernel
(
buildah config --label ostree.linux=<kver>); with rechunk in place the ostree metadata labels (ostree.commit,ostree.linux) are regenerated from the re-committed tree, collapsing that whole workaround class. Rule: any derived image that materially changes what an inherited label describes must overwrite it — rechunk does this for the ostree ones by construction.
7.5 zstd:chunked and partial pulls
Layer reuse is coarse: a chunk either matches or is re-downloaded whole.
zstd:chunked is the finer-grained complement — a compression format that
embeds a table of contents (per-file offsets + digests) in zstd skippable
frames. A containers/storage client with partial pulls enabled can fetch
only the file ranges it lacks and dedup the rest against local storage; it
also plugs directly into composefs. It stays valid zstd, so unaware clients
just decompress normally (unlike eStargz, which plays the same trick inside
gzip for containerd's lazy-pull snapshotter — a Kubernetes fast-start tool,
not a bootc one).
Margine's skopeo copy push does not currently set
--dest-compress-format zstd:chunked; the delta efficiency comes from
rechunk's stable layer digests alone. The two compose — rechunk decides what
the blobs are, zstd:chunked makes each blob partially fetchable — and
zstd:chunked is the obvious next increment, since Fedora's own bootc base
images and the bootc client stack are converging on it.
7.6 Alternatives & other distros
- hhd-dev/rechunk (Margine, Bazzite, Bluefin, Aurora): ostree-aware
re-layering, stable chunks, deterministic output. Cost: an extra ~minutes CI
step, root storage + disk staging, and it rewrites your manifest (labels
must be re-declared,
/etcfactory handling can surprise you — §7.4). - Plain Containerfile layers (early uBlue images, most homelab bootc
derivatives): zero extra tooling, buildah cache works during builds — but
every release re-downloads the fat
RUNlayers; fine for small images, painful past a few GB. rpm-ostree compose image/ ostree container encapsulate (stock Fedora Silverblue, Kinoite, IoT, CoreOS): composes from a treefile and emits an OCI image with built-in package-aware chunking (capped layer count, files grouped by change frequency) — the same idea as rechunk, but it requires owning the compose; it doesn't apply to aFROM-based derived build.- Flatten to one layer (
podman build --squash): simplest possible artifact, kills all reuse; every update is a full-image download. Only defensible for tiny images or air-gapped one-shot delivery. - estargz + stargz-snapshotter (containerd/k8s world): lazy pulls — start before the image finishes downloading. Solves container startup latency, not OS update deltas; no bootc integration.
- zstd:chunked (Fedora bootc base images, podman ecosystem direction): per-file TOC, partial pulls, composefs-friendly local dedup; complementary to rechunk rather than a replacement.
- ostree static deltas over plain HTTP (Endless OS, pre-OCI Fedora Atomic): server-precomputed binary deltas between commits — excellent download efficiency, but you run an ostree repo server instead of reusing registry infrastructure.
- openSUSE MicroOS/Aeon: no image artifact at all —
transactional-updateinstalls RPMs into a new btrfs snapshot; deltas are RPM-granular, but the result is assembled per-machine rather than tested-as-built. - Vanilla OS (ABRoot v2): OCI images applied to A/B root partitions; registry-based like bootc but partition-image semantics, without ostree's file-level store dedup.
- ChimeraOS (frzr): full root images as btrfs-subvolume tarballs from GitHub releases; dead simple, every update is a full download.
- NixOS: sidesteps the problem — there is no monolithic image; the store
path is the dedup unit and
nix copysubstitutes only missing derivations. Finest granularity of the lot, at the price of an entirely different model.
7.7 Takeaways
- OCI layer digests are tar digests; rebuild churn is structural, not a
buildah bug. Re-layer by content, not by
RUNorder. - Rechunk earns its place twice in Margine: small weekly downloads, and ostree-canonical commits that made composefs boot timing match upstream (retiring two boot workarounds).
- It is also a manifest rewrite: re-declare labels, re-verify
/etcfactory behavior, and keep the smoke-boot gate (chapter 8) downstream of it — the artifact you test must be the post-rechunk one, and--preserve-digestspromotion guarantees it's also the one users get.