Skip to content

New Kallisto-0.48.0, usage of -x BDWTA, outputting too many barcodes #77

@eyscott

Description

@eyscott

Hi,
I am using the 0.48.0 version of kallisto, as well as bustools (0.41.0) to demultiplex and obtain gene count tables for my BD Rhapsody WTA data. This is my initial kallisto bus script:
kallisto bus --index ./mus_musculus/transcriptome.idx -o /${f} --technology=BDWTA --threads=16 --fr-stranded ${f}_R1.fastq ${f}_R2.fastq -g /mus_musculus/Mus_musculus.GRCm38.96.gtf
Example result:
[index] k-mer length: 31
[index] number of targets: 118,489
[index] number of k-mers: 100,614,952
[index] number of equivalence classes: 433,624
[quant] will process sample 1: control_R1.fastq
control_R2.fastq
[quant] finding pseudoalignments for the reads ... done
[quant] processed 289,230,676 reads, 224,235,182 reads pseudoaligned
From there I sorted my .bus file and tried to generate a count table:
bustools sort -o sorted.bus output.bus
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o counts sorted.bus
This through me an odd matrix with dimensions: 13 18494348
From there I decided to correct the .bus file with bustools correct. I didn't see any whitelists for the BDWTA data so I also generated my own whitelists for each set of data and then sorted it:
bustools whitelist -o control_whitelist output.bus
Example results:
Read in 102086448 BUS records, wrote 232194 barcodes to whitelist with threshold 61
bustools correct -o corr_control.bus --whitelist control_whitelist output.bus
Example results:
Found 232194 barcodes in the whitelist
Processed 224235182 BUS records
In whitelist = 176801187
Corrected = 5916173
Uncorrected = 41517822
Then I sorted the .bus file
bustools sort -o sorted_corr_control.bus corr_control.bus
and ran bustools count:
bustools count --genecounts -g /mus_musculus/transcripts_to_genes.txt -t transcripts.txt -e matrix.ec -o control_counts sorted_corr_control.bus
I now have a matrix with more reasonable dimensions: 16632 9838 (with 9838 barcodes detected), but I am expecting to see ~2500 unique barcodes per sample. I am actually seeing a range between ~10,000 to 2500 barcodes per sample (across 4 samples). Do I have a mistake in how I am generating the whitelist? Is there already a built-in whitelist for the BDWTA data?
Thank you for your time!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions