Skip to content

Bustools with sci-RNA-seq3 #80

@rmg2213

Description

@rmg2213

Hi I am pretty new to kallisto/bustools so thank you for your help in advance. I followed directions from this issue/Google collab: Issue #75

I ran kallisto bus against the human reference downloaded from https://github.com/pachterlab/kallisto-transcriptome-indices/releases. The log file looks something like this:

[index] k-mer length: 31
[index] number of targets: 188,753
[index] number of k-mers: 109,544,288
[index] number of equivalence classes: 760,757
[quant] will process sample 1: R1_mod/12BH02_S96_R1_001_mod.fastq.gz
                               output_fastq/12BH02_S96_R2_001.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 4,927,842 reads, 2,397,115 reads pseudoaligned

Here is a run_info.json example from one of the samples of our run:

"n_targets": 0,
	"n_bootstraps": 0,
	"n_processed": 4927842,
	"n_pseudoaligned": 2360469,
	"n_unique": 1085578,
	"p_pseudoaligned": 47.9,
	"p_unique": 22.0,
	"kallisto_version": "0.46.2",
	"index_version": 0,
	"start_time": "Tue Jul 12 13:25:09 2022",
	"call": "kallisto/build/src/kallisto bus -i sci-RNA-seq3/reference/transcriptome.idx -x SciRnaSeq -t 2 -o bus_output/12BH02_S96 R1_mod/12BH02_S96_R1_001_mod.fastq.gz output_fastq/12BH02_S96_R2_001.fastq.gz"

My first question is why don't some values in run_info match the log (n_targets, n_pseudoaligned)? And next I was hoping to get some insight on why we might be getting low pseudoalignment? We even tried building a new index with kbref with these attributes and our p_pseudoaligned was still 55%:

kb ref -i $REFERENCE_DIR/kbref/include_attribute/h_index.idx \
-g $REFERENCE_DIR/kbref/include_attribute/h_t2g.txt \
-f1 $REFERENCE_DIR/kbref/include_attribute/cdna.fa \
-f2 $REFERENCE_DIR/kbref/include_attribute/intron.fa \
-c1 $REFERENCE_DIR/kbref/include_attribute/cdna_t2c.txt \
-c2 $REFERENCE_DIR/kbref/include_attribute/intron_t2c.txt \
--workflow lamanno \
$REFERENCE_DIR/ensembl_107/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz $REFERENCE_DIR/ensembl_107/Homo_sapiens.GRCh38.107.gtf.gz \
--include-attribute gene_biotype:protein_coding \
--include-attribute gene_biotype:lincRNA \
--include-attribute gene_biotype:antisense \
--include-attribute gene_biotype:IG_LV_gene \
--include-attribute gene_biotype:IG_V_gene \
--include-attribute gene_biotype:IG_V_pseudogene \
--include-attribute gene_biotype:IG_D_gene \
--include-attribute gene_biotype:IG_J_gene \
--include-attribute gene_biotype:IG_J_pseudogene \
--include-attribute gene_biotype:IG_C_gene \
--include-attribute gene_biotype:IG_C_pseudogene \
--include-attribute gene_biotype:TR_V_gene \
--include-attribute gene_biotype:TR_V_pseudogene \
--include-attribute gene_biotype:TR_D_gene \
--include-attribute gene_biotype:TR_J_gene \
--include-attribute gene_biotype:TR_J_pseudogene \
--include-attribute gene_biotype:TR_C_gene

Thanks again for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions