-
Notifications
You must be signed in to change notification settings - Fork 21
Fix unassigned fastq files #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
stu2
wants to merge
58
commits into
najoshi:master
Choose a base branch
from
serine:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…hat recursivelly makes directories if output file is a path of nested directories. Also added a few new command line options: --min-umi-len allowing to filter reads that have too short UMIs --stats allowing user to specify a file instead of spitting into stdout, this could be slightly broken if user tries to filter reads out with --min-umi-len, needs more checking. --no-comment which I'm not sure is good idea/needed. I needed it for compatability with donwstream tools, basically this strips anything from the FASTQ header that is deliminated by space, i.e everything to the right of the white space regarded as comment
changed tabs to spaces, made tabs = 4 spaces instead of 8 reindented the code :| universe please forgive me...
Changed strncmp_with_mismatch function to now also take an extra param --max_5prime_crop that will attempt to trim bases, one at a time from 5prime of the read this should improve assignment rates if there is a short overhang, like one or two base. I should also set a TODO to assert that max_5prime_crop <= strlen(read) and/or strlen(barcode)
…s a little It now takes barcode and fastq reads, allowed mismatches and max 5prime crop, which is at most number of bases to attempt to crop from 5 prime of the read in order to find assignment of the barcode.
added new column of total percent of the library for each barcode
one command (no sub commands i.e se or pe) instead have -m, --mode flag with several different mode options, all described in docs/mode.md
of week... From the diff and distance memory removed single_end menu options as I'm moving into sligntly different direction. One option is to have "mode" where single-end is just on of the modes.
This function no longer returns greater, equals or less then zero. Instead it simply looks for a barcode match, still allows 5 prime cropping, on match returns number of bases cropped, as this is important for downstream analysis in getting "actual barcode" sequence. Also remember that we are allowing mismatches in the barcode and perhaps we want that info later in the analysis somewhere. The other two important functions are get_fqread and get_merged_fqread both return a string that you can then write out. Originally I thought to just have get_fqread that then would take a mode and write "correct" fq string out. The idea behind get_merged_fqread function; in the case where R1 is just the barcode and R2 is just the read, we don't want to write out R1 since it holds no information, so simply merged two into one, appending barcode info into R2 header
i.e removing n bases from the back of the umi, where n = max-5prime-crop value
Also added redirect to a file umis_too_short.txt that now holds read names that were discarded due too short
e.g number of different barcode in a demultiplexed fastq given one mismatch was allowed also want similar metrics for umis
that returns unsorted table of barcodes and counts
are ought to be that length, reads with shorted umis will be thrown away, reads which have longer umis will be kept, but umi barcode will be trimmed back to min-umi-len to make all umis of uniform length, otherwise downstream analysis breaks - can't deduplicate if umis are of variable length
now can build dev version with make dev
I was removing one extra base from the end of the barcode Also refactored code for better redability and removed unwanted comments
sample barcodes OR metrics on umis. The implementation isn't robust, and there is some code duplication which will need to be refactored later on. Also because there are many more umis. I've seen upto 280k unique tags per sample, better memory management and samter search strategies will be needed
this wont compile but need to checkpoint the code. I'm scared to run gcc -Wall (0_0) ...
- droping kseq.h header can't multithread with that - wrote fastq parse - struct - pluged that in - updated Makefile
gziping by itself is slow as well and it is hard to tell what's better gzipping as I go or using stand alone pigz tools? I think the concern was that gzip ing worked better once the file stoped changing? - also in this commit some minor bug fixes
valgrind --leak-check=yes --track-origins=yes -v to get a dev version use make clean-all && make dev
this causes issues with conda packaging
at this stage this is more then appropriate
Error on missing barcode file
Update sabre.c
- Tries all barcodes up to max crop and max mismatch - Uses barcode that matches with fewest crop+mismatch
- Uses popen() to run either pigz or gzip
Find best matching barcode
Add option to compress (gzip) output files
to be "random" if statemnet inside switch
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
got rid of the 3rd-line info in fastq files as it is waste of space and was not formatted correctly anyway (it retained the '@' on the read name)