Move discovery to a separate process#538
Conversation
Signed-off-by: R Kent James <kent@caspia.com>
|
PS I also stopped using the cache by default. AFAICT the cache is not saving time and causes problems sometimes. The system of artifacts we are implementing will replace it. |
tfoote
left a comment
There was a problem hiding this comment.
Thanks for isolating this out. It's really great to see that artifact file and be able to see what information is needed/present.
The cache quite possible is over optimization. And you're right that a rebuilt is a few minutes to do discovery. However for a fresh build it involves the checkout of 2+ GB of new release repositories. It's a one time cost for a dev environment. I'm inclined to take it as it will get amortized. But I want to figure out what information we're getting from the release repositories and see if we can get it another way. From my first review everything in the discovery.json is something that's already in the rosdistro cache. Which would take this from minutes and GBs of data down to a few seconds. We can do that in another cycle though.
I noticed one warning that might be worth sorting out after too.
| DEFAULT_LANGUAGE_PREFIX = 'en' | ||
| HEAVY_CHECKMARK = "\u2714" | ||
| HEAVY_MINUS = "\u2796" | ||
| DISCOVERY_RESULTS = "_artifacts/discovery.json" |
There was a problem hiding this comment.
These might come in from the require_relative already.
There's warnings about them in the console output
/home/tullyfoote/rosindex/_plugins/rosindex_generator.rb:38: warning: already initialized constant DISCOVERY_RESULTS
/home/tullyfoote/rosindex/_ruby_libs/discovery.rb:12: warning: previous definition of DISCOVERY_RESULTS was here
/home/tullyfoote/rosindex/_plugins/rosindex_generator.rb:39: warning: already initialized constant DISCOVERY_ERRORS
/home/tullyfoote/rosindex/_ruby_libs/discovery.rb:13: warning: previous definition of DISCOVERY_ERRORS was here
I'm surprised that this is getting hit during make discovery.
You keep saying this. Maybe I am missing something, but AFAICT what the rosdistro cache adds is package.xml for each package. There is nothing in the package.xml that allows us to specify a canonical URL for each repo, which is what is needed for rosindex. That requires parsing the source, doc, and release portions of rosdistro, and deciding what is the canonical URL depending on that info. We could use the distro cache instead of rosdistro, but the information is the same and the work required would be the same. What is slow in the discovery is the need to download the release repo to get the canonical url, for repos that do not have a |
|
Yeah, if there's a small bit of extra data we can just add it to the cache. That's approximately 1 PR away. We can also consider doing things like adding a policy that a doc tag is required. The main thing that I want to do is not have a separate ruby parser for the rosdistro. We have a stable rosdistro python library that can collect all that information and we use it through our other toolchains and I don't want to have this entirely separate process that we have to maintain in parallel. It also has provider specific logic that can fetch files via API instead of requiring a full checkout. It also has robust incremental logic such that it runs in ~11 seconds per hourly rebuild of the cache: https://build.ros2.org/job/humble_rosdistro-cache/buildTimeTrend If we get to the point that we need/want extra info for rosindex we can provide that to anyone who uses the existing python tools. I've got a branch right now that pulls in the readme's as well. And can relatively easily extend that to CHANGELOGSs. We need to make sure it doesn't balloon too much to degrade other users. If we do hit that point we can do things like adding extra content in an opt in extended cache or something like that. We also can do things like consider changing policy to enforce that source or doc urls exist in the rosdistro at least looking forward. The policy changes and or format changes to enable us to have a clearer understanding of what the content is. If we have to have complex logic for determining a canonical URL for something in rosindex then everyone is potentially confused. |
This moves the discovery step away from being a section of generate, and into a separate process that generates an artifact. "Discovery" consists of parsing repos in the various distribution.yaml to determine which of the various URIs that could be represented in source, doc, or release sections (including tracks files) to determine a single URI for a repo that will be the source of indexing. Various other tidbits from distribution.yaml are collected as well.
The artifact generated here represents the information necessary to move forward with #511. That PR will be modified in a similar way as here to generate an artifact file containing the scraped repo information (including its description), taking as input the discovery.json artifact from this PR.
One important decision I made here that might be controversial is that the discovery.json artifact always represents the full list of repos, not respecting the max_repos config value. The reason is it seems fairly complex to try to figure out if a previous discovery.json file is limited in its repo count, then try to correct for that. A full discovery takes about 90 seconds on my system, and eliminating this source of complexity seems worth it. One downside is that running a test-build without an existing discovery.json artifact (for example in the git hub CI tests) will take an additional 90 seconds.