Skip to content

Conversation

@chaserhkj
Copy link
Contributor

This PR fixes the broken garbage collection implementation provided by cfsctl. To my knowledge this GC implementation is not used anywhere else, but this fixed implementation could be helpful for downstream project like bootc as well, see bootc-dev/bootc#1808

Previously, the GC implementation was completely broken as it didn't add any object IDs to the live set and would thus always mark every object for deletion. The code in gc_category seems to be assuming an old composefs structure where non-first-level entries in streams/ and images/ directory (e.g. streams/refs/some_distro/some_version) could directly link to object stores. But currently these links always link to first-level entries first and first-level entries would then link to objects in store.

This PR fixes this by doing a proper walk to add all objects referred from named references to the live set, such that all unlinked objects would be marked for deletion.

Furthermore, the old implementation was doing a naive shallow walk for the streams, this is problematic for pulled OCI images in composefs repo, since they have two layers of links, a config split stream linking all layers, and each layers linking to their layer contents. This PR adds a full walk algorithm to walk down and prune the entire stream tree to mark unlinked objects for deletion.

Currently this is still dry-run only, but I have changed the output format to add a "#" before all non-delete lines and the output could now just be piped to a shell to perform the deletion.

Note that bootc has its own GC implementations here. But bootc uses bootloader entries as part of the GC root, which I believe should be considered out of composefs scope. Bootc GC also does not prune streams. I think ideally we should have a complete GC implementation in composefs and bootc should just forward the call.

@Johan-Liebert1
Copy link
Collaborator

I think a lot of this will change once #185 lands. We'd probably want to wait until then, I think

@chaserhkj
Copy link
Contributor Author

I briefly skimmed the PR, I think as long as the split stream changes exposes the same get_object_refs interface with sane semantics, changes here will be fine. And the stream walking logic is still good as long as that interface returns objects referenced by the new named references as well.

That being said, it's totally fair to delay this PR ahead of a potentially dependent change as large as that. I could work on a rebase once that is landed.

Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not adding any tests here, but we definitely want that. Basic ones would be ensuring adding content and removing all the streams also GC's all objects, and two streams with shared objects but removing just one keeps the right objects etc.

@cgwalters
Copy link
Collaborator

Also sorry to be clear the above review was using the newly landed https://github.com/bootc-dev/agent-skills/blob/main/perform-forge-review/SKILL.md flow...and it somehow seems to have lost my header comment? I'll look at that.

@chaserhkj
Copy link
Contributor Author

I rebased the branch onto main after the splitstream changes. It seems that in the new format manifests referring to other streams in a separate table stream_refs. This slightly changes how things should be handled in my gc implementation. I'll work that changes in first then address all the review comments and add unit tests. I also have a few more improvements on the gc on my other branch that I feel should fall in the same scope so I might also pull these in.

@chaserhkj chaserhkj force-pushed the gc-fix branch 2 times, most recently from a6cefcf to 56c2597 Compare January 13, 2026 04:34
Signed-off-by: Chaser Huang <huangkangjing@gmail.com>
@chaserhkj chaserhkj force-pushed the gc-fix branch 2 times, most recently from ec9f7e5 to 604907a Compare January 13, 2026 06:05
Add support for user-specified GC Roots
Add cleanup of broken links after GC
Use log crate to properly log in GC

Signed-off-by: Chaser Huang <huangkangjing@gmail.com>
@chaserhkj chaserhkj force-pushed the gc-fix branch 4 times, most recently from 07e0787 to 87a7a1a Compare January 13, 2026 18:47
Fixes not marking named reference stream objects themselves as live

Signed-off-by: Chaser Huang <huangkangjing@gmail.com>
@chaserhkj
Copy link
Contributor Author

Done, I have included some improvements to the GC code as well. Now GC process can take caller-specified GC roots, and remove broken links in the repo. cfsctl interfaces are added and now --force actually performs removal. Full testsuite is also included that tests behavior against streams, images and streams using named references. I feel code at this stage should be robust enough.

PTAL @cgwalters

@chaserhkj chaserhkj requested a review from cgwalters January 13, 2026 18:57
@chaserhkj
Copy link
Contributor Author

chaserhkj commented Jan 13, 2026

Hold on, I found another bug with actual bootc repository testing.
I assumed the file names in streams/ will just be the same as the named reference table name field, but apparently they are not. So the stream name map might still be needed

Also includes regression test for these cases

Signed-off-by: Chaser Huang <huangkangjing@gmail.com>
@chaserhkj
Copy link
Contributor Author

Should be good now, I also included a regression test for the bug mentioned above

Copy link
Collaborator

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!! Sorry about the delay. I only have a relatively superficial review so far

GC {
// digest of root images for gc operations
#[clap(long, short = 'i')]
root_images: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait why would these be provided externally? Are they additional to the images present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are added to the list of GC roots that would be used for GC analysis. The existing internal references (in streams/refs/ and in images/refs) will always be considered and added to the GC root list.

The main reasoning here to allow additional external roots is mainly because we don't have a concrete standard to partition */refs directory into different namespaces for different users of composefs. Even bootc does not call composefs in a way that would create these references. So it is better to just assume the */refs references are not enough and it is up to the user to supplement all sensible roots. This concrete standard/scheme is particularly more relevant in pursuing the unified containers storage approach and we probably need to come to it later.

Besides it's really handy to be able to specify the additional roots when doing tests, management and diagnosis.

}

fn walk_symlinkdir(fd: OwnedFd, objects: &mut HashSet<ObjectID>) -> Result<()> {
fn walk_symlinkdir(fd: OwnedFd, entry_digests: &mut HashSet<CString>) -> Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should prefer Rust-native representations in memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now uses OsString

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want String here or OsString. Since the hash set stores digest strings, they should always be UTF8 and not really giving any UTF8 parsing errors. But the UTF8 conversion feels redundant anyways. Let me know your preferences.

)?;
} else {
// stream is in table but not in repo, the repo is potentially broken, issue a warning
warn!("broken repo: named reference stream {stream_name_in_table} not found as stream in repo");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this but in general I would say we should stick to debug! and trace! - warn! is likely to be routed to stderr which might not be expected in all places.

It's complicated but we could have APIs like this return something like a Vec<Warning> or so. There may be prior art for this.

&self,
root_image_names: Vec<String>,
root_stream_names: Vec<String>,
force: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think force could use a comment

pub fn gc(&self) -> Result<()> {
pub fn gc(
&self,
root_image_names: Vec<String>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass slices not Vec to functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants