Problematic nodes/edges

Hi, 

following up my email conversation with Adnan Malik. 

I created this tool to mine the ChEBI graph and identified some problematic Nodes/Edges. And suggested to remove some of them. Apparently the suggestions were already helpful. I am trying to find groups of ChEBI nodes which belong to the same structure (Enantiomers, Tautomers) and to select one representative structure for each one. To do so, I am using the is_enantiomer, is_tautomer and is_a edges in both directions (incoming and outgoing). 

First, I remove nodes which belong to patterns of compounds, indicated by SMILES strings with * or not present strings.

The problem that I am facing is that some structures are linked with an is_a link which actually should not be linked. 
For example, I had to remove all the di- and other oligo-peptides because they would always point to the mono-peptide.

Here are the largest subgraphs that I get. 

CHEBI:175256    124
CHEBI:192579     79
CHEBI:61313      65
CHEBI:33313      30
CHEBI:64961      30
CHEBI:18019      28
CHEBI:184013     28
CHEBI:86071      25
CHEBI:188921     23
CHEBI:17719      19
CHEBI:29708      17
CHEBI:15971      17
CHEBI:49140      15
CHEBI:16375      15

For the sake of not overwhelming you, I only report the most severe cases.


Many nodes are linked to 23053 catechin (graph CHEBI:175256).

![image](https://user-images.githubusercontent.com/3391614/216452338-de345ea2-b314-4d0d-af0f-cd53ed5af7c0.png)
Howver, this is given a specific SMILES string, but CHEBI:183094 is linked to be an instance of it

![image](https://user-images.githubusercontent.com/3391614/216452664-c65d7b7b-6300-459e-861a-45da0c2d726a.png)

In reality, 183094 contains 23053, but 23053 has no R group. They are actually two different compounds.

The problem is that particular instances and patterns (groups) of compounds are not properly identifieable in the database. Which comes from linguistinc ambiguity. When chemists talk, they probably know from the context if they mean a particular structure or a pattern. The database should carfully distinguish between those. An entity should be an instance or a group, but not both. And there should be another kind of link 'is_derivative' (maybe), or 'contains' to link peptides and di-peptides. That would be my suggestion to solve this. 

The way I see it is that Try-Ala is not a Try, but Try-Ala contains Try. This would make it easier to mine the graph. Right know, I have to use a lot of heuristics to do it properly.

I created the subgraphs of the above mentioned entities in my repository.

https://github.com/sorenwacker/chebi-tools in the analytics/Problematic-Nodes-Edges sub-folder.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problematic nodes/edges #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Problematic nodes/edges #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions