Skip to content

buildelement: Add the digest-environment config property#1991

Closed
Kekun wants to merge 4 commits intoapache:masterfrom
Kekun:add-digest-environment
Closed

buildelement: Add the digest-environment config property#1991
Kekun wants to merge 4 commits intoapache:masterfrom
Kekun:add-digest-environment

Conversation

@Kekun
Copy link

@Kekun Kekun commented Feb 5, 2025

This allows setting an environment variable inside the sandbox to the CAS
digest of one or more dependencies.

Fixes #1984

@Kekun Kekun changed the title buildelement: Add the digest-environment config property Draft: buildelement: Add the digest-environment config property Feb 5, 2025
@Kekun Kekun marked this pull request as draft February 10, 2025 09:06
@Kekun Kekun changed the title Draft: buildelement: Add the digest-environment config property buildelement: Add the digest-environment config property Feb 10, 2025
@Kekun Kekun force-pushed the add-digest-environment branch from b5aff04 to c55e109 Compare February 10, 2025 09:28
@Kekun Kekun force-pushed the add-digest-environment branch 2 times, most recently from ba8d175 to f822e69 Compare February 18, 2025 10:35
@Kekun Kekun force-pushed the add-digest-environment branch from f822e69 to 9ccd8e5 Compare June 2, 2025 09:30
@abderrahim abderrahim force-pushed the add-digest-environment branch 2 times, most recently from 938e18c to e81dc2f Compare June 17, 2025 16:06
sandbox.set_work_directory(command_dir)

def stage(self, sandbox):
# Setup environment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move this from configure_sandbox() to stage() because buildstream wouldn't let me stage dependency artifacts outside of stage()

Comment on lines 305 to 306
dummy_sandbox = SandboxDummy(
self._get_context(), self._get_project(), plugin=self, stdout=None, stderr=None, config={}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should try to have a nicer way to create a dummy sandbox

self._get_context(), self._get_project(), plugin=self, stdout=None, stderr=None, config={}
)
self.stage_dependency_artifacts(dummy_sandbox, element_list)
digest = dummy_sandbox.get_virtual_directory()._get_digest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linter complains here because _get_digest() is only defined for CasBasedDirectory (not the base Directory class) and get_virtual_directory() is advertised to return a Directory.

What would be the best way forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would cast work if you know here that you're actually dealing with specific subtype?

dummy_sandbox = SandboxDummy(
self._get_context(), self._get_project(), plugin=self, stdout=None, stderr=None, config={}
)
self.stage_dependency_artifacts(dummy_sandbox, element_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that we are staging elements in element_list which don't depend on eachother, I don't believe that Element.stage_dependency_artifacts() will do any sorting.

In the case that we have resulting overlaps in the resulting staged trees, I am worried that the resulting CAS digest may differ depending on the order in which the dependencies are declared in the buildstream .bst element.

Also there is a concern that overlaps may not turn out to be the same as when staging the element's overall dependencies.

I think abderrahim also noted the sandbox staging seems a bit weird, or could be done nicer.

I don't believe it should be algorithmically necessary to do the actual staging in order to calculate what the resulting CAS digest would be, probably there should be a way to support this better in the CasBasedDirectory code without doing an actual staging process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case that we are staging elements in element_list which don't depend on eachother, I don't believe that Element.stage_dependency_artifacts() will do any sorting.

I find that statement weird given that stage_dependency_artifact() is advertised to do just that.

I would also argue that if this is true, then it is orthogonal to the changes in this pull request. There is no difference between

filename:
- a.bst
- b.bst
- c.bst
config:
  location: /sysroot

and

filename:
- a.bst
- b.bst
- c.bst
config:
  digest-environment: DEPENDENCIES_CAS_DIGEST

in the current code, except for the fact that the former stages in a subdirectory of the build sandbox and the latter stages in a dummy sandbox.

I don't believe it should be algorithmically necessary to do the actual staging in order to calculate what the resulting CAS digest would be, probably there should be a way to support this better in the CasBasedDirectory code without doing an actual staging process.

Well, it depends. Currently an element plugin has no way to create a CasBasedDirectory using public API. The only way to get a CasBasedDirectory is throgh Sandbox.get_virtual_directory() (and even that advertises it returns a Directory base class). It also can't get the artifact contents of an element as a Directory: it only has access to Element.stage_artifact(). While this BuildElement is in core and can technically use private APIs, I tried to avoid going that route.

I think the difference we need to consider between the two approaches is whether or not to consider the dependencies? i.e. if I depend on gcc, is your expectation that this only gets the digest of the gcc artifact or should it be gcc and its runtime dependencies?

Copy link
Contributor

@gtristan gtristan Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abderrahim writes:

In the case that we are staging elements in element_list which don't depend on eachother, I don't believe that Element.stage_dependency_artifacts() will do any sorting.

I find that statement weird given that stage_dependency_artifact() is advertised to do just that.

Right, this text could be clarified further, this function says:

"This is primarily a convenience wrapper around Element.stage_artifact() which takes care of staging all the dependencies in staging order and issueing the appropriate warnings."

Which is not false, it does stage all of the specified element and it does so in staging order, however if you look at what the code does, it does not do any toplevel sorting of the input element sequence.

I would also argue that if this is true, then it is orthogonal to the changes in this pull request. There is no difference between:

depends:
- a.bst
- b.bst
- c.bst
  config:
    location: /sysroot

and

depends:
- a.bst
- b.bst
- c.bst
  config:
    digest-environment: DEPENDENCIES_CAS_DIGEST

in the current code, except for the fact that the former stages in a subdirectory of the build sandbox and the latter stages in a dummy sandbox.

I think you are missing my meaning with your example.

So here is my hypothesis; given your a, b, c example, let us assume that the toplevel element e in question depends on b and c which may also depend on a, where b and c install the same file.

e.g.:

            e
          /   \
         b     c  <--- both "b" and "c" install /etc/abc.conf
          \   / 
            a

What I am saying, is that with this implementation, the value of DEPENDENCIES_CAS_DIGEST may differ in the following two configurations.

Configuration 1

Here we stage element e deterministically into the sandbox, and stage c's version of /etc/abc.conf over b's version of /etc/abc.conf in an orthogonal, extra stage operation which we use to determine the cas digest DEPENDENCIES_CAS_DIGEST.

depends:
- a.bst
- b.bst
  config:
    digest-environment: DEPENDENCIES_CAS_DIGEST
- c.bst
  config:
    digest-environment: DEPENDENCIES_CAS_DIGEST

Configuration 2

Here we do the same as configuration 1, except that b's version of /etc/abc.conf is staged on top of c's version of the same file, resulting in a different digest.

depends:
- a.bst
- c.bst
  config:
    digest-environment: DEPENDENCIES_CAS_DIGEST
- b.bst
  config:
    digest-environment: DEPENDENCIES_CAS_DIGEST

A closer look

The body of Element.stage_dependency_artifacts() is doing the following:

with self._overlap_collector.session(action, path):
    for dep in self.dependencies(selection):
        dep._stage_artifact(sandbox, path=path, include=include, exclude=exclude, orphans=orphans, owner=self)

And, looking at Element.dependencies(), we can quickly see that when the selection argument is given to select "Subsets of the dependency graph", the self element is not considered at all in the algorithm.

This means that the overall deterministic staging order of element e is not considered when looping over the dependencies of the selection argument given to Element.stage_dependency_artifacts().

Arguably this algorithm in Element.stage_dependency_artifacts() needs to be made more deterministic in this regard, although I wouldn't necessarily say that the docs are lying here - they are yielded in staging order relative to eachother, but they are not yielded in the staging order of the given element.

And so I come back to your original argument:

I would also argue that if this is true, then it is orthogonal to the changes in this pull request.

I have a point of contention here.

I have not yet looked at whether the location dependency configuration suffers from this issue as well, however, if we have this issue with the location dependency configuration, it needs to be fixed, and I also do not want to go as far as landing this change in the knowledge that this is indeed an issue.

Prove me wrong :)

There may be other ways that I am wrong about this bug, but I can think of one possibility:

It is possible that in the Element loading/instantiation process, that the dependency configurations are exposed to the plugin in staging order rather than in the order in which they are listed in the element declaration (.bst file)... if this is the case, Element.dependencies() may be sufficient for both of these use cases.

It would however be good to add a test with this merge request, asserting that we get the same CAS digest environment variable using both orderings of element e in my overlapping example above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would however be good to add a test with this merge request, asserting that we get the same CAS digest environment variable using both orderings of element e in my overlapping example above.

I agree with this. I was just pointing out that if this behaviour is non-deterministic, then it's a pitfall that also applies to all plugins that make use of Element.stage_dependency_artifacts() and is not something specific to this use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would however be good to add a test with this merge request, asserting that we get the same CAS digest environment variable using both orderings of element e in my overlapping example above.

I agree with this. I was just pointing out that if this behaviour is non-deterministic, then it's a pitfall that also applies to all plugins that make use of Element.stage_dependency_artifacts() and is not something specific to this use case.

Maybe even worth regarding this a bug (design flaw) that should have its own ticket.

@abderrahim abderrahim force-pushed the add-digest-environment branch from e81dc2f to 75cc973 Compare June 18, 2025 10:11
@gtristan
Copy link
Contributor

gtristan commented Jun 20, 2025

I'll open a completely separate thread to follow up on the separate topic of "doing a whole stage" (I probably should have made that comment separately initially).

@abderrahim writes:

Well, it depends. Currently an element plugin has no way to create a CasBasedDirectory using public API. The only way to get a CasBasedDirectory is throgh Sandbox.get_virtual_directory() (and even that advertises it returns a Directory base class). It also can't get the artifact contents of an element as a Directory: it only has access to Element.stage_artifact(). While this BuildElement is in core and can technically use private APIs, I tried to avoid going that route.

I should point out that SandboxDummy is declared in sandboxing/_sandboxdummy.py, and not exposed in the toplevel __init__.py, and so it is not a public API, I'm not sure there is a sensible reason to make this public.

It does appear in sandbox/__init__.py which may strictly speaking be a bug on our part, as it may mislead people into thinking it is public.

I think the difference we need to consider between the two approaches is whether or not to consider the dependencies? i.e. if I depend on gcc, is your expectation that this only gets the digest of the gcc artifact or should it be gcc and its runtime dependencies?

I think this definitely needs to consider dependencies, although I'm not sure how that is related to using a SandboxDummy.

Frankly, the use of SandboxDummy is not a huge deal breaker for me if it really saves you some lines of code, my concerns about using it are:

  • Bringing in unrelated lines of code, i.e. the use of a sandbox object
    • Is it possible that sharing code in this way may cause sandbox related changes to break this feature ?
  • The possibility that those orthogonal lines of code may incur an unnecessary runtime overhead
    • Does this increase I/O in the calculation of the digest-environment ?

@abderrahim
Copy link
Contributor

I should point out that SandboxDummy is declared in sandboxing/_sandboxdummy.py, and not exposed in the toplevel __init__.py, and so it is not a public API, I'm not sure there is a sensible reason to make this public.

It does appear in sandbox/__init__.py which may strictly speaking be a bug on our part, as it may mislead people into thinking it is public.

I understand it's not intended to be public, and I commented about it earlier. However my main point is that the only public way to gain access to an element's artifact is through Element.stage_artifact() (and Element.stage_dependency_artifacts()) which depends on having a sandbox.

I feel that adding an API to expose a dummy sandbox is easier than trying to figure out what part of self.__artifact.get_files() to expose. Do we expose the Artifact class? or just the files virtual directory?

I think the difference we need to consider between the two approaches is whether or not to consider the dependencies? i.e. if I depend on gcc, is your expectation that this only gets the digest of the gcc artifact or should it be gcc and its runtime dependencies?

I think this definitely needs to consider dependencies, although I'm not sure how that is related to using a SandboxDummy.

Element.stage_dependency_artifacts() is the easiest (and I believe the recommended) way to get an element's artifact including its dependencies. And that requires a sandbox. It doesn't have to be a dummy sandbox, but a dummy sandbox is already used elsewhere as a fallback when a "real" sandbox is not available.

Frankly, the use of SandboxDummy is not a huge deal breaker for me if it really saves you some lines of code, my concerns about using it are:

  • Bringing in unrelated lines of code, i.e. the use of a sandbox object
    • Is it possible that sharing code in this way may cause sandbox related changes to break this feature ?
  • The possibility that those orthogonal lines of code may incur an unnecessary runtime overhead
    • Does this increase I/O in the calculation of the digest-environment ?

I don't think we need to worry about any of these. When staging, the Sandbox object is only used for Sandbox.get_virtual_directory(). The only unrelated lines of code that get executed are the Sandbox.__init__() and SandboxDummy.__init__(), which are only doing attribute assignment. I don't think the runtime overhead is significant, and there is no extra I/O involved compared to staging directly to a virtual (cas-based) directory.

In an ideal world, I'd prefer that Element.stage_directory() take a Directory rather than a Sandbox. Then for this MR, I'd want a way for a plugin to create a cas-based Directory instance. But we're stuck with Element.stage_directory() taking a Sandbox for now.

I'm happy to do it differently if you can think of a better way of doing it.

@gtristan
Copy link
Contributor

In an ideal world, I'd prefer that Element.stage_directory() take a Directory rather than a Sandbox. Then for this MR, I'd want a way for a plugin to create a cas-based Directory instance. But we're stuck with Element.stage_directory() taking a Sandbox for now.

Right, I think you mean Element.stage_artifact(), and yeah it might be cool if this was Directory based.

On the surface, I feel mostly concerned with the overhead of staging the artifact. This activity, at least on the surface, looks like an activity which involves I/O, when all we are actually looking for is what CAS digest would the root Directory have, if this set of artifacts were to be staged.

Looking at the operation, I see that we're essentially using CasBasedDirectory._import_files_internal(Directory, ...). For a CasBasedDirectory being imported - this essentially lands up in CasBasedDirectory.__partial_import_cas_into_cas().

While the code around here is a bit tricky to follow, I think I'm not really seeing any I/O, instead what generally happens during a build is:

  • Sanbox creates it's vdir
  • Elements stage artifacts into that vdir
  • This process only plays with data structures in the vdir CasBasedDirectory object hierarchy
  • SandboxREAPI._run() then serializes the in memory CasBasedDirectory into protobufs and delegates this to SandboxREAPI._execute_action, which can be handled by either SandboxBuildboxRun or SandboxRemote

So it looks like the I/O of managing the cache only ever happens immediately before a build activity or after when we cache the artifact.

In summary, I think I've looked at this enough to feel fairly confident that this activity is not really I/O intensive.

I'm happy to do it differently if you can think of a better way of doing it.

I think I'm pretty happy with this approach using a dummy sandbox for this, after having looked deeper into the implementation details :)

)
self.stage_dependency_artifacts(dummy_sandbox, element_list)
digest = dummy_sandbox.get_virtual_directory()._get_digest()
env[digest_variable] = "{}/{}".format(digest.hash, digest.size_bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue I discovered when testing with remote execution: If using this with recc/bazel (#1751), for setting a remote execution platform property (e.g. chrootRootDigest) then we also need to upload it to the remote execution CAS.

There used to be a time where buildstream could partially download artifacts,
which could mean that not all artifact blobs are available locally. This is no
longer the case: support for partial artifacts was dropped in favour of the
cache storage-service and is only used for buildtrees, which aren't used in
this case.

Downloading missing blobs from the cache storage-service is handled in
`CASCache._send_blobs()` called a few lines below.
This allows an element to use a secondary sandbox for manipulating artifacts
that doesn't affect the build
This allows setting an environment variable inside the sandbox to the CAS
digest of one or more dependencies.

Co-authored by: Adrien Plazas <adrien.plazas@codethink.co.uk>
The subsandboxes can be used to extract a CAS digest that could be used for
nested remote execution, and thus need to be available in the remote execution
CAS.
@abderrahim
Copy link
Contributor

Pushed this into the main repository as #2036. Closing.

I will try to transfer the non-resolved discussions to the new PR.

@abderrahim abderrahim closed this Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add dependency configuration to export a CAS digest as an environment variable

4 participants