Skip to content

Move discovery to a separate process#538

Merged
tfoote merged 1 commit intoros-infrastructure:ros2from
rkent:save-discovery
Jun 11, 2025
Merged

Move discovery to a separate process#538
tfoote merged 1 commit intoros-infrastructure:ros2from
rkent:save-discovery

Conversation

@rkent
Copy link
Contributor

@rkent rkent commented Jun 4, 2025

This moves the discovery step away from being a section of generate, and into a separate process that generates an artifact. "Discovery" consists of parsing repos in the various distribution.yaml to determine which of the various URIs that could be represented in source, doc, or release sections (including tracks files) to determine a single URI for a repo that will be the source of indexing. Various other tidbits from distribution.yaml are collected as well.

The artifact generated here represents the information necessary to move forward with #511. That PR will be modified in a similar way as here to generate an artifact file containing the scraped repo information (including its description), taking as input the discovery.json artifact from this PR.

One important decision I made here that might be controversial is that the discovery.json artifact always represents the full list of repos, not respecting the max_repos config value. The reason is it seems fairly complex to try to figure out if a previous discovery.json file is limited in its repo count, then try to correct for that. A full discovery takes about 90 seconds on my system, and eliminating this source of complexity seems worth it. One downside is that running a test-build without an existing discovery.json artifact (for example in the git hub CI tests) will take an additional 90 seconds.

Signed-off-by: R Kent James <kent@caspia.com>
@rkent
Copy link
Contributor Author

rkent commented Jun 4, 2025

PS I also stopped using the cache by default. AFAICT the cache is not saving time and causes problems sometimes. The system of artifacts we are implementing will replace it.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for isolating this out. It's really great to see that artifact file and be able to see what information is needed/present.

The cache quite possible is over optimization. And you're right that a rebuilt is a few minutes to do discovery. However for a fresh build it involves the checkout of 2+ GB of new release repositories. It's a one time cost for a dev environment. I'm inclined to take it as it will get amortized. But I want to figure out what information we're getting from the release repositories and see if we can get it another way. From my first review everything in the discovery.json is something that's already in the rosdistro cache. Which would take this from minutes and GBs of data down to a few seconds. We can do that in another cycle though.

I noticed one warning that might be worth sorting out after too.

DEFAULT_LANGUAGE_PREFIX = 'en'
HEAVY_CHECKMARK = "\u2714"
HEAVY_MINUS = "\u2796"
DISCOVERY_RESULTS = "_artifacts/discovery.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These might come in from the require_relative already.

There's warnings about them in the console output

/home/tullyfoote/rosindex/_plugins/rosindex_generator.rb:38: warning: already initialized constant DISCOVERY_RESULTS
/home/tullyfoote/rosindex/_ruby_libs/discovery.rb:12: warning: previous definition of DISCOVERY_RESULTS was here
/home/tullyfoote/rosindex/_plugins/rosindex_generator.rb:39: warning: already initialized constant DISCOVERY_ERRORS
/home/tullyfoote/rosindex/_ruby_libs/discovery.rb:13: warning: previous definition of DISCOVERY_ERRORS was here

I'm surprised that this is getting hit during make discovery.

@tfoote tfoote merged commit ecddb7d into ros-infrastructure:ros2 Jun 11, 2025
1 check passed
@rkent
Copy link
Contributor Author

rkent commented Jun 11, 2025

From my first review everything in the discovery.json is something that's already in the rosdistro cache.

You keep saying this. Maybe I am missing something, but AFAICT what the rosdistro cache adds is package.xml for each package. There is nothing in the package.xml that allows us to specify a canonical URL for each repo, which is what is needed for rosindex. That requires parsing the source, doc, and release portions of rosdistro, and deciding what is the canonical URL depending on that info. We could use the distro cache instead of rosdistro, but the information is the same and the work required would be the same.

What is slow in the discovery is the need to download the release repo to get the canonical url, for repos that do not have a source or doc entry. Again, this information is not available in the current form of rosdistro cache. I suspect that during the preparation of the rosdistro cache that information is available, but it is not stored at the moment in a way accessible to rosindex.

@tfoote
Copy link
Member

tfoote commented Jun 12, 2025

Yeah, if there's a small bit of extra data we can just add it to the cache. That's approximately 1 PR away. We can also consider doing things like adding a policy that a doc tag is required. The main thing that I want to do is not have a separate ruby parser for the rosdistro. We have a stable rosdistro python library that can collect all that information and we use it through our other toolchains and I don't want to have this entirely separate process that we have to maintain in parallel. It also has provider specific logic that can fetch files via API instead of requiring a full checkout. It also has robust incremental logic such that it runs in ~11 seconds per hourly rebuild of the cache: https://build.ros2.org/job/humble_rosdistro-cache/buildTimeTrend

If we get to the point that we need/want extra info for rosindex we can provide that to anyone who uses the existing python tools. I've got a branch right now that pulls in the readme's as well. And can relatively easily extend that to CHANGELOGSs. We need to make sure it doesn't balloon too much to degrade other users. If we do hit that point we can do things like adding extra content in an opt in extended cache or something like that.

We also can do things like consider changing policy to enforce that source or doc urls exist in the rosdistro at least looking forward. The policy changes and or format changes to enable us to have a clearer understanding of what the content is. If we have to have complex logic for determining a canonical URL for something in rosindex then everyone is potentially confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants