Skip to content

[reproducibility] (Re-)generation of biosystems.txt and DisGeNET_diseases.txt #12

@cthoyt

Description

@cthoyt

The documentation says that this file was created from ChEMBL 24, PubChem, and DisGeNet . There have been several releases since with more data, which could improve the goodness and utility of your models.
However, it's not clear how these resource files were created. To assess the correctness of the work, it would also be necessary to show that the pipeline for getting data is not only reproducible, but makes sense. Seeing the code that does this gives insights into the special cases that might have been encountered and how they're handled, that would make your data output different from one that somebody would make by following your work as a guide, but without access to your code.

This should also apply to the two resources that you ask the user to download.

Caveat: While ChEMBL has versioned downloads, PubChem's rolling release only allows for the download of the most recent months/days. I'm not sure about DisGeNet. I know this might make it impossible to reproduce the generation of the exact datasets, which is why it's also good to have the dumps in this repo, so thanks for that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions