Skip to content

Conversation

@dacharyc
Copy link
Collaborator

This PR adds a new script to compare the GDCD ingest logs (from Snooty Data API ingest job) to the audit-cli output from local monorepo files.

In investigating discrepancies, I also discovered that docs-k8s-operator is deprecated and we should no longer be ingesting data for it during our weekly ingest job.

Copy link
Collaborator

@cbullinger cbullinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it!

Comment on lines +37 to +39
1. Runs audit-cli once to identify projects that exist only in audit-cli (not in the log)
2. Re-runs audit-cli with those projects excluded using the `--exclude-dirs` flag
3. Compares the filtered results for a cleaner comparison
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

// projectNameMapping maps log file project names to their audit-cli equivalents.
// This handles cases where the same project has different names in the GDCD logs
// versus the audit-cli output. Add new mappings here as needed.
var projectNameMapping = map[string]string{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh another place to custom map names 😑

Comment on lines 237 to 238
- **Only in log**: Projects found in the log but not in audit-cli output (may indicate project name mismatches)
- **Only in audit-cli**: Projects found in audit-cli but not in the log - these are automatically excluded in the second run for a cleaner comparison
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused by the output, i think. will these ever be populated? e.g. i see we have a handful of "only in <log/audit-cli>" entries but these are both 0 in the summary -- would they have values on the first run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the code is structured, they're populated on the "initial run" and then the tool re-runs the audit-cli with excluded dirs for only in audit-cli entries. At that point, the number is reduced to 0. The "only in log" entries can be populated by new projects that we haven't added naming mapping for (if the project name does not match the name in the audit-cli) but will probably be 0 other than that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the example we give shows both types in the results, though, which is why i'm confused. shouldn't the summary only in log reflect the three results that are marked with only in log?
i'm also not really seeing the value of showing the only in audit-cli if it effectively gets reduced to 0 every time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good points. Made some minor tweaks to the way the output is generated to:

  • Omit only in audit-cli since it should never be populated
  • Only conditionally show only in log if there are projects that only appear in the log but not audit-cli
  • Also check that audit-cli is available before trying to run the thing

I also updated the example output in the README so hopefully this is all consistent now. 🤞

@dacharyc dacharyc merged commit 80eefc9 into main Dec 11, 2025
1 check passed
@dacharyc dacharyc deleted the gdcd-compare-page-counts branch December 11, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants