Skip to content

Problematic nodes/edges #1

@sorenwacker

Description

@sorenwacker

Hi,

following up my email conversation with Adnan Malik.

I created this tool to mine the ChEBI graph and identified some problematic Nodes/Edges. And suggested to remove some of them. Apparently the suggestions were already helpful. I am trying to find groups of ChEBI nodes which belong to the same structure (Enantiomers, Tautomers) and to select one representative structure for each one. To do so, I am using the is_enantiomer, is_tautomer and is_a edges in both directions (incoming and outgoing).

First, I remove nodes which belong to patterns of compounds, indicated by SMILES strings with * or not present strings.

The problem that I am facing is that some structures are linked with an is_a link which actually should not be linked.
For example, I had to remove all the di- and other oligo-peptides because they would always point to the mono-peptide.

Here are the largest subgraphs that I get.

CHEBI:175256 124
CHEBI:192579 79
CHEBI:61313 65
CHEBI:33313 30
CHEBI:64961 30
CHEBI:18019 28
CHEBI:184013 28
CHEBI:86071 25
CHEBI:188921 23
CHEBI:17719 19
CHEBI:29708 17
CHEBI:15971 17
CHEBI:49140 15
CHEBI:16375 15

For the sake of not overwhelming you, I only report the most severe cases.

Many nodes are linked to 23053 catechin (graph CHEBI:175256).

image
Howver, this is given a specific SMILES string, but CHEBI:183094 is linked to be an instance of it

image

In reality, 183094 contains 23053, but 23053 has no R group. They are actually two different compounds.

The problem is that particular instances and patterns (groups) of compounds are not properly identifieable in the database. Which comes from linguistinc ambiguity. When chemists talk, they probably know from the context if they mean a particular structure or a pattern. The database should carfully distinguish between those. An entity should be an instance or a group, but not both. And there should be another kind of link 'is_derivative' (maybe), or 'contains' to link peptides and di-peptides. That would be my suggestion to solve this.

The way I see it is that Try-Ala is not a Try, but Try-Ala contains Try. This would make it easier to mine the graph. Right know, I have to use a lot of heuristics to do it properly.

I created the subgraphs of the above mentioned entities in my repository.

https://github.com/sorenwacker/chebi-tools in the analytics/Problematic-Nodes-Edges sub-folder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions