Data missing in Zenodo v2 archive, RMSD calculation errors, and request for pre-computed metrics

Hi @CAODH ,

First of all, thank you for your excellent work on MolGenBench and for making the code and data publicly available. It’s a very valuable benchmark for the community.

I’ve been trying to reproduce the evaluation results using the version2 data uploaded to Zenodo and have encountered a few issues that I wanted to bring to your attention.

**1. Missing files in Zenodo dataset**

When running the provided evaluation scripts, several .sdf files from generative sampling or vina docking appear to be missing or empty, which prevents the full reproduction of the reported results for certain methods/targets. Below are some of the missing files I’ve identified:
TargetDiff:
P06239/Round2/De_novo_Results/TargetDiff_generated_molecules/P06239_TargetDiff_generated_molecules_vina_docked.sdf is missing.
P06239/Round3/De_novo_Results/TargetDiff_generated_molecules/P06239_TargetDiff_generated_molecules_vina_docked.sdf is missing.
DecompDiff:
P00915/Round1/De_novo_Results/DecompDiff_generated_molecules/P00915_DecompDiff_generated_molecules.sdf is missing.
P00915/Round1/De_novo_Results/DecompDiff_generated_molecules/P00915_DecompDiff_generated_molecules_vina_docked.sdf is missing.
...
ShapeMol_Hit_to_Lead:
P55055/Round1/Hit_to_Lead_Results/Sries63387/ShapeMol_Hit_to_Lead/P55055_Sries63387_ShapeMol_Hit_to_Lead.sdf is missing.
P55055/Round1/Hit_to_Lead_Results/Sries63387/ShapeMol_Hit_to_Lead/P55055_Sries63387_ShapeMol_Hit_to_Lead_vina_docked.sdf is missing.
...
shepherd_x1x3x4_mosesaq_submission_Hit_to_Lead:
Q07817/Round1/Hit_to_Lead_Results/Sries44140/shepherd_x1x3x4_mosesaq_submission_Hit_to_Lead/Q07817_Sries44140_shepherd_x1x3x4_mosesaq_submission_Hit_to_Lead.sdf is missing.
Q07817/Round1/Hit_to_Lead_Results/Sries44140/shepherd_x1x3x4_mosesaq_submission_Hit_to_Lead/Q07817_Sries44140_shepherd_x1x3x4_mosesaq_submission_Hit_to_Lead_vina_docked.sdf is missing.
...
DeleteHit2Lead(CrossDock)_Hit_to_Lead:
P42336/Round1/Hit_to_Lead_Results/Sries23245/DeleteHit2Lead(CrossDock)_Hit_to_Lead/P42336_Sries23245_DeleteHit2Lead(CrossDock)_Hit_to_Lead.sdf is missing.
P42336/Round1/Hit_to_Lead_Results/Sries23245/DeleteHit2Lead(CrossDock)_Hit_to_Lead/P42336_Sries23245_DeleteHit2Lead(CrossDock)_Hit_to_Lead_vina_docked.sdf is missing.
...
DiffDec_Hit_to_Lead:
Q05397/Round1/Hit_to_Lead_Results/Sries59918/DiffDec_Hit_to_Lead/Q05397_Sries59918_DiffDec_Hit_to_Lead.sdf is missing.
Q05397/Round1/Hit_to_Lead_Results/Sries59918/DiffDec_Hit_to_Lead/Q05397_Sries59918_DiffDec_Hit_to_Lead_vina_docked.sdf is missing.
Q05397/Round2/Hit_to_Lead_Results/Sries59918/DiffDec_Hit_to_Lead/Q05397_Sries59918_DiffDec_Hit_to_Lead.sdf is a empty file.
...

This causes the evaluation pipeline to skip these entries or raise Error (OSError: File error: Invalid input file). Could you please check and, if possible, re-upload the complete dataset?

**2. RMSD calculation failures**

During the RMSD metric calculation, many molecules fail with the following error:
[RMSDMetric] Symmetry RMSD failed: Graphs are not isomorphic. Make sure graphs have the same connectivity.

My hypothesis is that this might be caused by a mismatch in molecular topology (connectivity) between the original generated molecules and the Vina-docked output (possibly during the .pdbqt to .sdf conversion). Has this issue been encountered before? If there is a known fix or preprocessing step to ensure graph isomorphism, it would be very helpful to know.

**3. Request for pre-computed evaluation metrics**

Currently, the Zenodo archive contains the generated molecules and docked outputs, but to obtain the final evaluation metrics (e.g., hit rediscovery rate, RMSD, IFP score), users must run the entire evaluation pipeline locally. This is **quite time-consuming**, and the provided Jupyter notebooks for visualization are somewhat difficult to follow.

Would it be possible for you to **release the pre-computed evaluation results** (e.g., CSV/JSON files with all calculated metrics) alongside the molecular data? This would greatly facilitate reproducibility, analysis, and comparison for users who wish to build upon your benchmark without re-running lengthy computations.

Additionally, if the notebooks could be **refactored for better readability** (e.g., with clearer sections, comments, and maybe a simplified plotting interface), that would be a great help for the community.

Thank you again for your hard work and for considering these points. I’m happy to provide more details or logs if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data missing in Zenodo v2 archive, RMSD calculation errors, and request for pre-computed metrics #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data missing in Zenodo v2 archive, RMSD calculation errors, and request for pre-computed metrics #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions