Skip to content

Add new abstract methods for reporting source provenance#1997

Merged
gtristan merged 11 commits intomasterfrom
tristan/sboms
May 6, 2025
Merged

Add new abstract methods for reporting source provenance#1997
gtristan merged 11 commits intomasterfrom
tristan/sboms

Conversation

@gtristan
Copy link
Contributor

@gtristan gtristan commented Mar 4, 2025

This is an initial implementation for the proposal up at: https://lists.apache.org/thread/q6gxjpld2vb1c9rqlsv24m12c087snc4

Some thoughts about this approach:

  • This prioritizes machine readability and standardization of source provenance and version information

    Sources have a lot of freedom in how they implement things, and so we may very well need to expand on the types and constants added here, such as SourceInfoMedium, SourceVersionType, etc.

    The idea here is to have greater certainty about how sources are obtained, even if this cannot be covered by all currently existing source implementations (e.g. I didn't initially add a bzr medium for which we have a plugin, or a cvs medium for which we do not yet have a plugin).

    Aspirationally, forcing this data to be precise can allow adjacent tooling to do useful things.

  • This drops the freeform "public data" mentioned in the proposal discussion

    My rationale for this choice in this branch, is that ultimately we want a data with a constant shape, and if for examlpe, we want the user to be able to override or assist a source with determining the reported "version", then the Source implementation already has everything it needs to do so:

    • it can add additional configuration keys for the user to configure
    • it has the power to implement collect_source_info() however it wants.
  • This does not yet attempt to cover the concept of tracking information

    I would like to consider this, but we should think carefully about how this can be useful. For instance, some git plugins have different interpretations of what their "tracking" strings mean, sometimes following a branch head, sometimes looking for the latest tag in history which matches a given regular expression.

    If we export this tracking information, it should probably be useful for external tooling to figure out how to do the tracking and come to the same conclusion, otherwise it is unclear what this is useful for.

  • This does not cover the CVE information

    While the SourceInfo objects representing a source's provenance is a list, I believe that the CVE information continues to be a per-element concept.

    For example, when we have applied security patches to a module, those security patches are, themselves, sources, with provenance of being revisioned in the local project

@nanonyme
Copy link
Contributor

We have been having the challenge with bst-to-lorry that we're missing a way for source to explicitly state relevant refspecs for git. Heuristically computed once are commonly incomplete or even wrong. So if there's API, it really should take this into account. The problem is that commonly in git repositories there are branches that need to be ignored (like pull request branches) since they are routinely force pushed to. Also with git_module it's common that upstream doesn't declare at all what the branch is from which commit originated.

@shymega
Copy link

shymega commented Mar 27, 2025

In terms of implementing this into plugins that currently use the internal API, would someone be able to write up an example plugin that uses the new API - I'm a little lost on the implementation semantics...

@gtristan gtristan force-pushed the tristan/sboms branch 3 times, most recently from e551c42 to ea856b1 Compare March 30, 2025 09:00
@gtristan gtristan changed the title Add new abstract methods for reporting source provenance WIP: Add new abstract methods for reporting source provenance Mar 30, 2025
@gtristan
Copy link
Contributor Author

gtristan commented Mar 30, 2025

I've been testing this locally with freedesktop-sdk master and the sister merge requests applied:

And running:

bst show --format $'%{name}\n%{source-info}'  platform.bst

There are still more plugins to implement, notably we can have version guessing on the git_tag or similar community plugins which encode more context than just the commit sha into the ref.

The cargo2 plugin works, but only if you correct gstreamer-plugins-rs.bst to set the git-mirrors parameter (that whole thing looks like it needs support from buildstream to work with the project aliases automatically... there is probably an issue open about that).

@gtristan
Copy link
Contributor Author

gtristan commented Mar 31, 2025

In terms of implementing this into plugins that currently use the internal API, would someone be able to write up an example plugin that uses the new API - I'm a little lost on the implementation semantics...

See the core plugins in this PR (separate commits), or the changes in the sister PRs

Copy link
Contributor

@abderrahim abderrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm starting to like this, I left a few comments on how I think we can improve it.

@shymega
Copy link

shymega commented Apr 10, 2025

apache/buildstream-plugins/pull/870

@gtristan This should be apache/buildstream-plugins#87, not https://github.com/apache/buildstream-plugins/pull/870.

@gtristan
Copy link
Contributor Author

gtristan commented Apr 19, 2025

@abderrahim, @juergbi

Some additional thoughts now that I get down to business...

Abolish the whole freeform string entirely

In our meeting this week, we discussed that we would prefer to keep the flexibility and not require that SourceInfoMedium and SourceVersionType enumeration values be strictly followed (resulting in the requirement of many distinct values being introduced into the core)...

And we discussed that we would instead prefer freeform strings for these attributes of the SourceInfo object, and that we should:

  • Automatically encode the plugin "kind" name into the SourceInfo reported
  • Provide a machine readable way (e.g. via bst show) to provide the plugin provenance information for given plugin kinds
  • Have the deciding information about the meaning of the SourceInfo.version string (e.g. is it a git sha or is it a sha256sum of a downloaded file or what), documented in the governing source plugin documentation individually.

Ok, this all makes sense, however now that I look at it again, I no longer see any need whatsoever of having the SourceInfoMedium or SourceVersionType information present at all in the SourceInfo object.

Instead we only need the following members:

  • kind: Plugin "kind" string
  • url: Unaliased upstream full URL (not mirrored URL)
  • version: The machine readable precise version (e.g. git sha, CAS/docker digest, sha256sum, etc)
  • version_guess: The plugin assisted guess at what the human readable version is

Given that we can obtain the precise plugin and version for a given kind, and that that governing plugin documentation must already document precisely what it means in the version member, the whole concept of the medium and version_type are obsolete and essentially meaningless.

Automate the plugin specific documentation

As a separate thought, I feel like the need to associate the governing documentation of what a version means with a given plugin kind is overly onerous.

I think that while this can continue to be true, we could make this easier in the following way:

  • Source plugins must document the meaning of the version in their Source.collect_source_info() implementation's docstring
  • BuildStream can then inspect that docstring at runtime and provide a way to also report that docstring in some machine readable way.

This approach would allow, from both a collect_manifest plugin perspective, and from a CLI wrapper script perspective, to generate reports which both:

  • Have a section including the governing plugin documentation about the version for each plugin "kind" in the context of that pipeline
  • Have the SourceInfo members for each Source instance, including the plugin "kind" name in the context of that pipeline

@abderrahim
Copy link
Contributor

Ok, this all makes sense, however now that I look at it again, I no longer see any need whatsoever of having the SourceInfoMedium or SourceVersionType information present at all in the SourceInfo object.

Instead we only need the following members:

* `kind`: Plugin "kind" string

* `url`: Unaliased upstream full URL (not mirrored URL)

* `version`: The machine readable _precise_ version (e.g. git sha, CAS/docker digest, sha256sum, etc)

* `version_guess`: The plugin assisted _guess_ at what the human readable version is

I feel this pushes it too far in the opposite direction. This assumes a single mapping between a plugin kind and a (medium, version_type) pair, which isn't always the case. For instance the cargo2 plugin can use both tarballs and git repositories. Also plugins like git, git_repo and git_module are very similar in what they provide and having a user of the API need to know all the different plugins seems unweildy.

The main goal behind having an API like this is to reduce the amount of knowledge that a user of the API needs about specific plugins.

Given that we can obtain the precise plugin and version for a given kind, and that that governing plugin documentation must already document precisely what it means in the version member, the whole concept of the medium and version_type are obsolete and essentially meaningless.

"We" as humans, sure. But as a script using the API, having a well defined meaning for the version field and a standardized way to define things is very important. Even if a new value comes up later, then depending on the use case the script could output a warning or outright fail. But having it fail on every new plugin isn't scalable.

@gtristan
Copy link
Contributor Author

gtristan commented Apr 19, 2025

[...]

I feel this pushes it too far in the opposite direction. This assumes a single mapping between a plugin kind and a (medium, version_type) pair, which isn't always the case. For instance the cargo2 plugin can use both tarballs and git repositories. Also plugins like git, git_repo and git_module are very similar in what they provide and having a user of the API need to know all the different plugins seems unweildy.

Yeah, I was just now thinking about this one source with multiple source infos... its unfortunate :-/

For the different flavors of git, I'm less concerned, they could all be made to behave the same way, and in any case, they would probably anyway need to document what they mean (even if they share a freeform string) - if not, they are just taking advantage of a documented convention in upstream BuildStream (although that is still nasty because it makes BuildStream know what git is).

Given that we can obtain the precise plugin and version for a given kind, and that that governing plugin documentation must already document precisely what it means in the version member, the whole concept of the medium and version_type are obsolete and essentially meaningless.

"We" as humans, sure. But as a script using the API, having a well defined meaning for the version field and a standardized way to define things is very important. Even if a new value comes up later, then depending on the use case the script could output a warning or outright fail. But having it fail on every new plugin isn't scalable.

While this point is now moot due to the preceding point, I would have agrued that it is in fact humans which need to know "What to do with this specific plugin kind" when encountering one programatically.

This is the same kind of documentation that will be needed for representing what the meaning of these free form strings are.

Do you think there is any reason to have 2 different strings (one for the medium and one for the version) ? Or would it be sufficient to have a single string to define how a given SourceInfo instance behaves ?

@gtristan
Copy link
Contributor Author

[...]

Do you think there is any reason to have 2 different strings (one for the medium and one for the version) ? Or would it be sufficient to have a single string to define how a given SourceInfo instance behaves ?

I came to the following conclusions:

  • I liked that we had strong types, it helps how things are documented
  • I liked that we had this separation of "where it comes from" and "what the version means", as stated before on this issue, this reduces the need for a lot of redundant enum values
  • We do still have a need for third party plugin developers to use freeform strings, as discussed last week

I've updated the branch to use Union[SourceVersionType, str] and Union[SourceInfoMedium, str] types and reverted to the original implementation with this change, this seems to be a happy compromise.

@gtristan gtristan force-pushed the tristan/sboms branch 5 times, most recently from 14f8dfc to ca338d5 Compare April 23, 2025 06:37
@gtristan
Copy link
Contributor Author

gtristan commented Apr 25, 2025

This is now ready, documentation and testing and all, along with apache/buildstream-plugins#87 which is also ready to the same degree.

And so is: https://gitlab.com/BuildStream/buildstream-plugins-community/-/merge_requests/364

@gtristan gtristan force-pushed the tristan/sboms branch 6 times, most recently from 09dedc2 to 5a95a84 Compare April 28, 2025 11:53
Copy link
Contributor

@abderrahim abderrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the doucmentation, but otherwise looks good.

@gtristan gtristan force-pushed the tristan/sboms branch 5 times, most recently from 454610e to fc11127 Compare May 3, 2025 07:13
Copy link
Contributor

@abderrahim abderrahim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land this.

gtristan and others added 11 commits May 5, 2025 14:57
This is an opaque public data structure passed through SourceFetcher.fetch()
which, if provided, must be used when invoking Source.translate_url()
This comes with some data classes to describe source provenance
and versioning information.
This shows a list of all source provenance information for a given
element, formatted in YAML.
The new man pages now include the bst version in them, and
the documentation about generating man pages is also updated
to specify that one should specify `tox -e man -- <version>`.
@gtristan gtristan merged commit 47923ca into master May 6, 2025
17 checks passed
@gtristan gtristan deleted the tristan/sboms branch May 6, 2025 07:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants