Skip to content

Add method to get QC web reports from db with web api#1074

Open
EddieLF wants to merge 3 commits intodevfrom
update_qc_report_links
Open

Add method to get QC web reports from db with web api#1074
EddieLF wants to merge 3 commits intodevfrom
update_qc_report_links

Conversation

@EddieLF
Copy link
Contributor

@EddieLF EddieLF commented Jun 17, 2025

This PR

  • Introduces a new WebAPI endpoint to get QC report entries for a given project, and for given sequencing types and analysis.meta stages.
  • Adds a new method to the WebLayer to query the database for the QC entries
  • Adds a new method to the WebDb which queries the analysis and analysis_outputs table for QC records with the given sequencing types and stages.
  • Adds new models ProjectQcWebReport and ProjectQcWebReportInternal
    • These models capture the analysis information (id, timestamp, seq type, stage, output path, sequencing groups)
    • Contains a method to transform the analysis.output into the pure HTML path that can be visited in the browser

Still TODO:

  • Update the frontend code to pull in the QC web reports for a dataset and display the URLs for the most recent reports
  • Possibly add logic to display multiple / all report links? And let the user see what sequencing groups are in each report?

At present, when visiting the Metamist Project page, the user is presented with links to MultiQC reports.

image

These reports are hard-coded URLs that are based on the dataset name and nothing else (see the frontend code). This is bad because these links go to historic QC reports, not recent ones. Mistakenly viewing old reports can lead to errors and misunderstandings from unaware analysts.

To fix this, we need to query the database to get the analysis records and file paths to the QC outputs, and then use some logic to feed the HTML path of the latest report in to the front end.

@dancoates dancoates self-requested a review June 17, 2025 04:00
@codecov
Copy link

codecov bot commented Jun 17, 2025

Codecov Report

Attention: Patch coverage is 56.41026% with 17 lines in your changes missing coverage. Please review.

Project coverage is 82.48%. Comparing base (1777145) to head (4b3ce3c).

Files with missing lines Patch % Lines
models/models/web.py 70.83% 7 Missing ⚠️
api/routes/web.py 28.57% 5 Missing ⚠️
db/python/layers/web.py 37.50% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1074      +/-   ##
==========================================
- Coverage   82.54%   82.48%   -0.07%     
==========================================
  Files         189      189              
  Lines       16535    16575      +40     
==========================================
+ Hits        13649    13672      +23     
- Misses       2886     2903      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@dancoates dancoates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good just a few tweaks needed

JSON_UNQUOTE(JSON_EXTRACT(a.meta, '$.stage')) as stage,
GROUP_CONCAT(DISTINCT asg.sequencing_group_id) as sequencing_groups
FROM analysis a
LEFT JOIN analysis_output ao ON ao.analysis_id = a.id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table name is analysis_outputs

AND a.status = 'COMPLETED'
AND JSON_UNQUOTE(JSON_EXTRACT(a.meta, '$.sequencing_type')) IN :sequencing_types
AND JSON_UNQUOTE(JSON_EXTRACT(a.meta, '$.stage')) in :stages
GROUP BY a.id, ao.output, a.timestamp_completed, sequencing_type, stage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to group by a.timestamp_completed, sequencing_type, stage as these all relate to analysis table fields and you're already grouping by the primary key id

a.id,
a.timestamp_completed,
ao.output,
JSON_UNQUOTE(JSON_EXTRACT(a.meta, '$.sequencing_type')) as sequencing_type,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to use JSON_VALUE rather than JSON_UNQUOTE(JSON_EXTRACT( here, I think it'll be a little faster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for in the WHERE statement below

if not self.output or not self.output.startswith('gs://'):
return None
bucket_name, blob_name = self.output.removeprefix('gs://').split('/', 1)
blob_name = blob_name.replace('multiqc_data.json', 'multiqc.html')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a check somewhere here that the output contains multiqc. There are other types of QC report that could passed through this - it'd be better for them to return None rather than a broken link.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the check above

Honestly the way the multiqc reports have been written as analysis records is so not ideal. Instead of the HTML being written to the analysis record, only the multiqc_data.json path has been recorded. Hence the need for this transform. Really, we should be leveraging multiple analysis outputs for this kind of record. But for now this should work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, it would be good if this could be more generic and support linking to any type of analysis output which identifies itself as a HTML report.

):
"""Get qc web report analyses for a project filtered by sequencing type and stage."""
_query = """
SELECT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query is only looking in the analysis_outputs table but will need to additionally look in the output_file table for qc reports there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now joining to output_file using the analysis_outputs.file_id field, and using the output_file.path field as our output path returned by the query

@EddieLF EddieLF requested a review from dancoates June 18, 2025 01:25
if not self.output or not self.output.startswith('gs://'):
return None
bucket_name, blob_name = self.output.removeprefix('gs://').split('/', 1)
blob_name = blob_name.replace('multiqc_data.json', 'multiqc.html')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, it would be good if this could be more generic and support linking to any type of analysis output which identifies itself as a HTML report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants