Skip to content

Comment on the section: 3.2.4. Data distribution check #3

@massonix

Description

@massonix

Hi,

Thanks for developing this wonderful analysis workflow.

I'm running the pipeline outlined in the notebook LME_Classification.ipynb. This section has the following two plots:

image

These plots represent the distribution of the mean expression for all genes. The interpretation of it is the following:

In the plot on the right, the expression values start off very low and then rise before dropping down. This pattern suggests potential RNA degradation, which can compromise the reliability and accuracy of downstream analyses. In contrast, the distribution plot on the left shows good-quality gene expression data. Deviations from such distributions may indicate gene degradation, should be carefully investigated and, if necessary, corrected to ensure high-quality data.

This is how my distribution looks like

image

However, I don't understand how this should be problematic. A common pre-processing step in any RNA-seq analysis is to exclude lowly expressed genes, which do not contain enough information for robust statistical analysis. This is the plot in my R markdown notebook where I choose the expression to exclude genes:

image

which looks like the plot on the left. Thus, after filtering all I'm left with is highly expressed, reliable genes. What does that have to do with RNA degradation?

If you could explain it it'd be super useful.

Thanks!

Ramon

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions