Skip to content

Conversation

@stu2
Copy link

@stu2 stu2 commented Dec 9, 2020

got rid of the 3rd-line info in fastq files as it is waste of space and was not formatted correctly anyway (it retained the '@' on the read name)

serine added 30 commits March 6, 2018 18:39
…hat recursivelly

makes directories if output file is a path of nested directories.

Also added a few new command line options:

 --min-umi-len allowing to filter reads that have too short UMIs

 --stats allowing user to specify a file instead of spitting into stdout,
this could be slightly broken if user tries to filter reads out with --min-umi-len,
needs more checking.

 --no-comment which I'm not sure is good idea/needed.
I needed it for compatability with donwstream tools,
basically this strips anything from the FASTQ header that is deliminated by space,
i.e everything to the right of the white space regarded as comment
changed tabs to spaces, made tabs = 4 spaces instead of 8
reindented the code :| universe please forgive me...
Changed strncmp_with_mismatch function to now also take an extra param
--max_5prime_crop that will attempt to trim bases, one at a time from 5prime of the read
this should improve assignment rates if there is a short overhang, like
one or two base. I should also set a TODO to assert that
max_5prime_crop <= strlen(read) and/or strlen(barcode)
…s a little

It now takes barcode and fastq reads, allowed mismatches and max 5prime crop,
which is at most number of bases to attempt to crop from 5 prime of the
read in order to find assignment of the barcode.
added new column of total percent of the library for each barcode
one command (no sub commands i.e se or pe) instead have -m, --mode flag
with several different mode options, all described in docs/mode.md
of week... From the diff and distance memory removed single_end menu
options as I'm moving into sligntly different direction. One option
is to have "mode" where single-end is just on of the modes.
This function no longer returns greater, equals or less then zero.
Instead it simply looks for a barcode match, still allows 5 prime
cropping, on match returns number of bases cropped, as this is important
for downstream analysis in getting "actual barcode" sequence. Also
remember that we are allowing mismatches in the barcode and perhaps we
want that info later in the analysis somewhere.

The other two important functions are get_fqread and get_merged_fqread
both return a string that you can then write out.
Originally I thought to just have get_fqread that then would take a mode
and write "correct" fq string out.

The idea behind get_merged_fqread function; in the case where R1 is just
the barcode and R2 is just the read, we don't want to write out R1
since it holds no information, so simply merged two into one, appending
barcode info into R2 header
i.e removing n bases from the back of the umi, where n = max-5prime-crop value
Also added redirect to a file umis_too_short.txt
that now holds read names that were discarded due too short
e.g number of different barcode in a demultiplexed fastq
given one mismatch was allowed
also want similar metrics for umis
that returns unsorted table of barcodes and counts
are ought to be that length, reads with shorted umis
will be thrown away, reads which have longer umis will be
kept, but umi barcode will be trimmed back to min-umi-len
to make all umis of uniform length, otherwise downstream
analysis breaks - can't deduplicate if umis are of variable length
now can build dev version with make dev
I was removing one extra base from the end of the barcode

Also refactored code for better redability and removed unwanted comments
serine and others added 28 commits August 21, 2018 08:40
sample barcodes OR metrics on umis.
The implementation isn't robust, and there is some code duplication
which will need to be refactored later on.
Also because there are many more umis. I've seen upto 280k unique tags
per sample, better memory management and samter search strategies will
be needed
this wont compile but need to checkpoint the code.
I'm scared to run gcc -Wall (0_0) ...
- droping kseq.h header can't multithread with that
- wrote fastq parse - struct
- pluged that in
- updated Makefile
gziping by itself is slow as well and it is hard to tell what's better
gzipping as I go or using stand alone pigz tools? I think the concern
was that gzip ing worked better once the file stoped changing?

- also in this commit some minor bug fixes
valgrind --leak-check=yes --track-origins=yes -v

to get a dev version use make clean-all && make dev
this causes issues with conda packaging
at this stage this is more then appropriate
Error on missing barcode file
- Tries all barcodes up to max crop and max mismatch
- Uses barcode that matches with fewest crop+mismatch
- Uses popen() to run either pigz or gzip
Find best matching barcode
Add option to compress (gzip) output files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants