Skip to content

refactor: ♻️ Major improvements to license and metadata workflows#176

Merged
slugb0t merged 16 commits intomainfrom
staging
Jan 13, 2026
Merged

refactor: ♻️ Major improvements to license and metadata workflows#176
slugb0t merged 16 commits intomainfrom
staging

Conversation

@slugb0t
Copy link
Member

@slugb0t slugb0t commented Jan 13, 2026

Summary by Sourcery

Refactor and harden license and metadata validation workflows, centralizing file detection and database updates, improving logging, and enhancing dashboard issue rendering to better reflect validation states and service errors.

New Features:

  • Add centralized helpers to detect and fetch metadata, license, code of conduct, and contributing files with richer status objects and content.
  • Introduce structured validation result handling for metadata files, including support for unknown states when external validators fail.
  • Expose new endpoints and models in the validator API to accept raw codemeta payloads and handle null content safely.

Bug Fixes:

  • Ensure metadata and license dashboard sections are updated in place when rerunning validations without corrupting other issue content.
  • Fix handling of missing or malformed metadata and license content, preventing crashes and improving user-facing messages.

Enhancements:

  • Rework metadata database update flow to decouple validation from persistence, add revalidation rules based on commit history, and improve error resilience.
  • Improve license workflow by supporting multiple LICENSE filename variants, persisting custom license titles, and simplifying template rendering.
  • Unify auxiliary FAIR checks (code of conduct and contributing guidelines) under a shared module with reusable file detection logic.
  • Enhance logging utility to consistently format structured messages for console output and remote log ingestion.
  • Simplify Nuxt build configuration and dependencies, including icon handling, transpilation targets, and framework version alignment.

Build:

  • Update Nuxt and front-end dependencies to newer versions and adjust transpilation and icon module configuration accordingly.

Chores:

  • Add bot reset script for cleaning build artifacts and reinstalling dependencies.

@fairdataihub-bot
Copy link

Thank you for submitting this pull request! We appreciate your contribution to the project. Before we can merge it, we need to review the changes you've made to ensure they align with our code standards and meet the requirements of the project. We'll get back to you as soon as we can with feedback. Thanks again!

@sourcery-ai
Copy link

sourcery-ai bot commented Jan 13, 2026

Reviewer's Guide

Refactors license and metadata workflows to centralize file detection and validation, adds robust database update flows and logging, modernizes the FAIR dashboard templates, and updates the validator API and Nuxt UI configuration to support richer validation states and dependency alignment.

Sequence diagram for metadata validation and database update workflow

sequenceDiagram
    actor GitHubWebhook
    participant BotComplianceChecks
    participant MetadataModule as metadata_index_js
    participant GitHubAPI as GitHub_API
    participant ValidatorService as Validator_API
    participant Database as DB
    participant Logwatch

    GitHubWebhook->>BotComplianceChecks: push event(context, owner, repository)
    BotComplianceChecks->>MetadataModule: checkMetadataFilesExists(context, owner, repository)
    MetadataModule->>GitHubAPI: GET codemeta.json
    GitHubAPI-->>MetadataModule: 200 or 404
    MetadataModule->>GitHubAPI: GET CITATION.cff
    GitHubAPI-->>MetadataModule: 200 or 404
    MetadataModule-->>BotComplianceChecks: {codemeta, citation}

    BotComplianceChecks->>MetadataModule: updateMetadataDatabase(repoId, subjects, repository, owner, context)

    MetadataModule->>Database: ensureMetadataRecord(repository.id, subjects)
    Database-->>MetadataModule: existing codeMetadata or created record

    MetadataModule->>MetadataModule: determineRevalidationNeeds(context, subjects)
    MetadataModule-->>Logwatch: info revalidation flags

    alt revalidationNeeded
        MetadataModule->>GitHubAPI: gatherMetadata(context, owner, repository)
        GitHubAPI-->>MetadataModule: base metadata
        MetadataModule->>MetadataModule: applyDbMetadata(existing, metadata)

        alt codemetaExists and revalidateCodemeta
            MetadataModule->>GitHubAPI: getCodemetaContent(context, owner, repository)
            GitHubAPI-->>MetadataModule: codemetaContent or null
            alt codemetaContent
                MetadataModule->>ValidatorService: POST /validate-codemeta {file_content}
                ValidatorService-->>MetadataModule: {message, version, error}
                MetadataModule->>MetadataModule: validateCodemeta() -> ValidationResult
                MetadataModule-->>Logwatch: success/warn/info codemeta status
                MetadataModule->>MetadataModule: applyCodemetaMetadata(codemetaContent, metadata, repository)
            else no codemetaContent
                MetadataModule->>MetadataModule: codemetaValidation = ValidationResult.invalid("File not found")
                MetadataModule-->>Logwatch: info no codemeta.json
            end
        end

        alt citationExists and revalidateCitation
            MetadataModule->>GitHubAPI: getCitationContent(context, owner, repository)
            GitHubAPI-->>MetadataModule: citationContent or null
            alt citationContent
                MetadataModule->>ValidatorService: POST /validate-citation {file_path}
                ValidatorService-->>MetadataModule: {message, output, error}
                MetadataModule->>MetadataModule: validateCitation() -> ValidationResult
                MetadataModule-->>Logwatch: success/warn/info citation status
                MetadataModule->>MetadataModule: applyCitationMetadata(citationContent, metadata, repository)
            else no citationContent
                MetadataModule->>MetadataModule: citationValidation = ValidationResult.invalid("File not found")
                MetadataModule-->>Logwatch: info no CITATION.cff
            end
        end

        MetadataModule->>Database: updateMetadataRecord(repoId, metadata, codemetaValidation, citationValidation, subjects)
        Database-->>MetadataModule: updated record
    else noRevalidation
        MetadataModule->>MetadataModule: reuse existing metadata and validation
    end

    MetadataModule-->>BotComplianceChecks: {metadata, validCodemeta, validCitation, codemetaValidation, citationValidation, existing}
    BotComplianceChecks-->>GitHubWebhook: compliance results
Loading

Sequence diagram for rerunMetadataValidation command workflow

sequenceDiagram
    actor Maintainer
    participant GitHubUI as GitHub_UI
    participant BotApp as Bot_App
    participant CommandHandler as rerunMetadataValidation
    participant MetadataModule as metadata_index_js
    participant LicenseModule as license_index_js
    participant GitHubAPI as GitHub_API
    participant Database as DB
    participant Logwatch

    Maintainer->>GitHubUI: comment /rerun-metadata-validation
    GitHubUI->>BotApp: issue_comment event(context)
    BotApp->>CommandHandler: rerunMetadataValidation(context, owner, repository, issueBody)
    CommandHandler->>Logwatch: start "Rerunning metadata validation"

    CommandHandler->>MetadataModule: checkMetadataFilesExists(context, owner, repository)
    MetadataModule->>GitHubAPI: GET codemeta.json and CITATION.cff
    GitHubAPI-->>MetadataModule: contents or 404
    MetadataModule-->>CommandHandler: subjects {codemeta, citation}

    CommandHandler->>LicenseModule: checkForLicense(context, owner, repository.name)
    LicenseModule->>GitHubAPI: check license files
    GitHubAPI-->>LicenseModule: file or not found
    LicenseModule-->>CommandHandler: license object {status, path, content, spdx_id}
    CommandHandler->>CommandHandler: subjects.license = license

    CommandHandler->>CommandHandler: build syntheticContext with pusher.name = GH_APP_NAME[bot]
    CommandHandler->>MetadataModule: updateMetadataDatabase(repository.id, subjects, repository, owner, syntheticContext)
    MetadataModule->>Database: ensureMetadataRecord and updateMetadataRecord
    Database-->>MetadataModule: updated codeMetadata
    MetadataModule-->>CommandHandler: validation results

    CommandHandler->>MetadataModule: applyMetadataTemplate(subjects, baseTemplate, repository, owner, syntheticContext)
    MetadataModule-->>CommandHandler: newMetadataSection

    CommandHandler->>CommandHandler: replace Metadata section in issueBody
    CommandHandler->>CommandHandler: updatedBody = applyLastModifiedTemplate(updatedBody)

    CommandHandler->>BotApp: createIssue(context, owner, repository, ISSUE_TITLE, updatedBody)
    BotApp->>GitHubAPI: create or update issue
    GitHubAPI-->>BotApp: issue updated
    BotApp-->>CommandHandler: ok
    CommandHandler->>Logwatch: info "Metadata validation rerun completed"

    alt error
        CommandHandler-->>Logwatch: error details
        CommandHandler->>GitHubAPI: update issue with restored body
        GitHubAPI-->>CommandHandler: ok
        CommandHandler-->>BotApp: throw error
    end
Loading

Sequence diagram for rerunLicenseValidation command workflow

sequenceDiagram
    actor Maintainer
    participant GitHubUI as GitHub_UI
    participant BotApp as Bot_App
    participant CommandHandler as rerunLicenseValidation
    participant LicenseModule as license_index_js
    participant MetadataModule as metadata_index_js
    participant GitHubAPI as GitHub_API
    participant Database as DB
    participant Logwatch

    Maintainer->>GitHubUI: comment /rerun-license-validation
    GitHubUI->>BotApp: issue_comment event(context)
    BotApp->>CommandHandler: rerunLicenseValidation(context, owner, repository, issueBody)

    CommandHandler->>Logwatch: start "Rerunning License Validation"
    CommandHandler->>LicenseModule: checkForLicense(context, owner, repository.name)

    LicenseModule->>GitHubAPI: check LICENSE, LICENSE.md, LICENSE.txt
    GitHubAPI-->>LicenseModule: license file or none

    alt license found
        LicenseModule-->>CommandHandler: license {status, path, content, spdx_id}
        CommandHandler->>LicenseModule: updateLicenseDatabase(repository, license)
        LicenseModule->>Database: findUnique licenseRequest(repository_id)
        alt existing entry
            Database-->>LicenseModule: existingLicense
            LicenseModule->>LicenseModule: validateLicense(license, existingLicense)
            LicenseModule->>Database: update licenseRequest
        else no entry
            Database-->>LicenseModule: null
            LicenseModule->>Database: create licenseRequest
        end
    else no license
        LicenseModule-->>CommandHandler: {status: false, path, content: "", spdx_id: null}
        CommandHandler-->>Logwatch: error "License not found"
        CommandHandler-->>BotApp: throw Error
    end

    note over CommandHandler,GitHubAPI: If license updated successfully, rebuild LICENSE section in issue body

    CommandHandler->>CommandHandler: strip last updated timestamp from issueBody
    CommandHandler->>LicenseModule: applyLicenseTemplate({license}, baseTemplate, repository, owner, context)
    LicenseModule-->>CommandHandler: newLicenseSection

    CommandHandler->>CommandHandler: replace LICENSE section in issueBody
    CommandHandler->>CommandHandler: lastModifiedBody = applyLastModifiedTemplate(updatedBody)

    CommandHandler->>BotApp: createIssue(context, owner, repository, ISSUE_TITLE, lastModifiedBody)
    BotApp->>GitHubAPI: create or update issue
    GitHubAPI-->>BotApp: issue updated
    CommandHandler->>Logwatch: info "License validation rerun completed"

    alt any error
        CommandHandler-->>Logwatch: error details
        CommandHandler->>GitHubAPI: restore issue with body without command
        GitHubAPI-->>CommandHandler: ok
        CommandHandler-->>BotApp: throw error with cause
    end
Loading

File-Level Changes

Change Details Files
Refactor metadata validation and persistence to be file-centric, robust against errors, and to store rich validation state in the database and dashboard.
  • Add helpers to fetch and detect codemeta.json and CITATION.cff, including null-safe error handling and parallel fetching.
  • Introduce ValidationResult helper and normalizeText utility to standardize validation statuses, error reporting, and JSON/YAML parsing for metadata files.
  • Refactor validateCodemeta and validateCitation to call external validator services, handle service errors as 'unknown' status, and avoid direct DB writes.
  • Add determineRevalidationNeeds and ensureMetadataRecord to control when metadata is revalidated based on commits, license changes, and bot vs human pushes.
  • Rewrite updateMetadataDatabase to orchestrate gathering metadata, applying codemeta/citation data, and updating DB with detailed validation status and messages.
  • Enhance applyMetadataTemplate to render metadata section dynamically based on license/metadata presence and detailed validation results, including emoji/status table.
bot/compliance-checks/metadata/index.js
bot/compliance-checks/index.js
bot/compliance-checks/archival/index.js
Unify license handling around a richer license object, centralize DB updates, and improve LICENSE section rendering and validation rerun behavior.
  • Change checkForLicense to detect multiple LICENSE file names, return structured info including content and SPDX ID, and rely on generic checkForFile.
  • Refactor validateLicense to operate on the new license object and existing DB record, with better handling of NOASSERTION and custom licenses.
  • Introduce updateLicenseDatabase to encapsulate create/update logic for licenseRequest records, preserving custom titles where appropriate.
  • Update applyLicenseTemplate to use the richer license object, show clearer messaging for missing or custom licenses, and rely on updateLicenseDatabase.
  • Rewrite rerunLicenseValidation to use checkForLicense, updateLicenseDatabase, and applyLicenseTemplate while only replacing the LICENSE section of the issue body and preserving the rest.
  • Extend iterateCommitDetails to treat different license filenames consistently and to propagate structured license subject state on add/remove.
bot/compliance-checks/license/index.js
bot/commands/validations/index.js
bot/utils/tools/index.js
Consolidate additional compliance checks (Code of Conduct and Contributing) into a shared module with flexible file detection.
  • Add checkForCodeofConduct and checkForContributingFile utilities that search multiple common paths using checkForFile and return structured status/content.
  • Wire commands and compliance orchestration to import these checks from the new additional-checks module instead of per-check modules.
  • Improve Additional Recommendations copy and rely on structured subjects to render recommendation rows.
bot/compliance-checks/additional-checks/index.js
bot/commands/validations/index.js
Improve rerun metadata validation workflow to leverage new metadata APIs and to surgically update the Metadata section in the FAIR dashboard issue.
  • Change rerunMetadataValidation to use checkMetadataFilesExists, checkForLicense, and updateMetadataDatabase with a synthetic bot context to force revalidation.
  • Use applyMetadataTemplate to regenerate only the Metadata section and splice it into the existing issue body while preserving other sections and the last-updated footer.
  • Add detailed logging and error-handling paths that attempt to restore the issue body on failure and rethrow errors with context.
bot/commands/validations/index.js
Enhance logging infrastructure for better JSON console output and consistent formatting.
  • Add a private _formatForConsole helper to stringify object logs for consola while leaving strings untouched.
  • Update all logwatch log-level methods to pass formatted messages to consola but keep original structure for remote logging.
  • Provide explicit json() helper that uses the same formatting logic for console output before sending structured logs.
bot/utils/logwatch.js
Update validator API and UI dependencies for codemeta validation and Nuxt 4 compatibility, including icon module migration.
  • Change codemeta validation endpoint to use a Flask-RESTX model with fields.Raw and add explicit null checks for file_content, returning 400 on missing input.
  • Update Nuxt config to always transpile UI dependencies, switch from nuxt-icon to @nuxt/icon, and adjust modules list accordingly.
  • Adjust UI package.json dependencies to add @nuxt/icon, bump nuxt to ^4.2.2, align notivue and naive-ui versions, and add vueuc as a runtime dependency.
  • Add a convenience reset script to the bot package.json to clear build artifacts and reinstall dependencies.
validator/apis/__init__.py
ui/nuxt.config.ts
ui/package.json
bot/package.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@fairdataihub-bot
Copy link

Thanks for making updates to your pull request. Our team will take a look and provide feedback as soon as possible. Please wait for any GitHub Actions to complete before editing your pull request. If you have any additional questions or concerns, feel free to let us know. Thank you for your contributions!

@what-the-diff
Copy link

what-the-diff bot commented Jan 13, 2026

PR Summary

  • Improved .gitignore Configurations

    • Recognition of .vscode/ directory was added in this update to heighten privacy on individual IDE settings.
  • Created New Functions for Compliance Checks

    • checkForCodeofConduct and checkForContributingFile were implemented in bot/compliance-checks/additional-checks/index.js to enhance accuracy in locating important information files.
  • Refined Feedback Mechanisms in Archival Checks

    • The message warning about license file verification upon detecting it was added to noLicenseText in bot/compliance-checks/archival/index.js.
  • Simplified Compliance-checks Folder

    • Deletion of bot/compliance-checks/citation/index.js and bot/compliance-checks/code-of-conduct/index.js helped minimize redundant features.
  • Deleted Outdated Metadata Management Files

    • The removal of codemeta/index.js and contributing/index.js has simplified the overall metadata checking process.
  • Updated index.js with New Import

    • Removed old imports of checkForCitation and checkForCodeMeta and included checkMetadataFilesExists to ensure accuracy and efficiency in checking metadata files.
  • Adaptations in Package Configurations

    • The package.json file now includes a new script for environment resetting and the added @nuxt/icon dependency.
  • Enhanced Message Formatting in logwatch.js

    • Introduction of _formatForConsole helps in pre-formatting console messages along with updates made in usage throughout the existing functions.
  • Better License File Verification

    • tools/index.js now performs enhanced LICENSE file checks and has improved the marking system for their existence.
  • VS Code Directories Omitted in UI Direction

    • Addition of .vscode/ to the .gitignore file in the UI directory ensures IDE settings remain private.
  • Refined Nuxt Configurations

    • Updates in nuxt.config.ts optimizes the transpile section and brings in the new Nuxt icon module.
  • Improved Codemeta File Validation

    • The validator API in validator/apis/__init__.py now utilizes a model to validate codemeta.json file contents and added null checks to ensure the required parameter file_content.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In several places contains_metadata is computed as !!(subjects.citation & subjects.codemeta) which uses bitwise & instead of logical &&; this will coerce booleans to numbers and can silently produce incorrect state in the DB, so it should be updated to subjects.citation && subjects.codemeta.
  • The subjects.license shape is now inconsistent (boolean in some paths, object with status in others, e.g. runComplianceChecks/iterateCommitDetails vs applyArchivalTemplate and applyLicenseTemplate), which can lead to treating a repository with a license as if it has none; consider normalizing subjects.license to a single object shape throughout.
  • In rerunMetadataValidation and rerunLicenseValidation, the logic that slices the issue body on the 'Last updated' marker and on section headings (## Metadata/## LICENSE) assumes those markers always exist; adding guards for indexOf returning -1 would make the rerun flows more robust against manually edited or legacy issues.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In several places `contains_metadata` is computed as `!!(subjects.citation & subjects.codemeta)` which uses bitwise `&` instead of logical `&&`; this will coerce booleans to numbers and can silently produce incorrect state in the DB, so it should be updated to `subjects.citation && subjects.codemeta`.
- The `subjects.license` shape is now inconsistent (boolean in some paths, object with `status` in others, e.g. `runComplianceChecks`/`iterateCommitDetails` vs `applyArchivalTemplate` and `applyLicenseTemplate`), which can lead to treating a repository with a license as if it has none; consider normalizing `subjects.license` to a single object shape throughout.
- In `rerunMetadataValidation` and `rerunLicenseValidation`, the logic that slices the issue body on the `'Last updated'` marker and on section headings (`## Metadata`/`## LICENSE`) assumes those markers always exist; adding guards for `indexOf` returning `-1` would make the rerun flows more robust against manually edited or legacy issues.

## Individual Comments

### Comment 1
<location> `bot/compliance-checks/metadata/index.js:365` </location>
<code_context>
+          contains_metadata: !!(subjects.citation & subjects.codemeta),
</code_context>

<issue_to_address>
**issue (bug_risk):** Bitwise `&` is used instead of logical `&&` when computing `contains_metadata`, which will produce incorrect results.

In both `ensureMetadataRecord` and `updateMetadataRecord`, `contains_metadata` is computed as `!!(subjects.citation & subjects.codemeta)`. Because `&` does numeric coercion, this only behaves like a logical AND while both values are strictly boolean. If either becomes a non-boolean truthy/falsy value, the result can be incorrect. Please use `subjects.citation && subjects.codemeta` instead to avoid this subtle bug.
</issue_to_address>

### Comment 2
<location> `bot/compliance-checks/metadata/index.js:109-118` </location>
<code_context>
+      `Metadata validation rerun completed for repo: ${repository.name} (ID: ${repository.id})`
     );
-    await createIssue(context, owner, repository, ISSUE_TITLE, lastModified);
   } catch (error) {
-    // Remove the command from the issue body
-    const issueBodyRemovedCommand = issueBody.substring(
-      0,
-      issueBody.indexOf(`<sub><span style="color: grey;">Last updated`)
-    );
-    const lastModified = await applyLastModifiedTemplate(
-      issueBodyRemovedCommand
+    logwatch.error(
+      {
+        message: "Failed to rerun metadata validation",
+        repo: repoInfo,
+        error: error.message,
+        stack: error.stack,
+      },
+      true
     );
-    await createIssue(context, owner, repository, ISSUE_TITLE, lastModified);
</code_context>

<issue_to_address>
**suggestion (bug_risk):** Codemeta fetch/validation errors are logged but do not update the validation status, which may leave stale or misleading DB state.

When `codemeta.json` fetch fails in `updateMetadataDatabase`, the catch only logs a warning and leaves `codemetaValidation` unchanged (likely carrying over a previous successful state). This can cause the DB to show a stale “valid” status when validation actually failed. Align this with the CITATION branch by setting `codemetaValidation = ValidationResult.error(error)` in the catch so `updateMetadataRecord` captures the current failure correctly.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@slugb0t slugb0t merged commit 173149c into main Jan 13, 2026
5 checks passed
@fairdataihub-bot
Copy link

Thanks for closing this pull request! If you have any further questions, please feel free to open a new issue. We are always happy to help!

@slugb0t slugb0t deleted the staging branch February 17, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant