Fix unassigned fastq files #17

stu2 · 2020-12-09T03:05:59Z

got rid of the 3rd-line info in fastq files as it is waste of space and was not formatted correctly anyway (it retained the '@' on the read name)

…hat recursivelly makes directories if output file is a path of nested directories. Also added a few new command line options: --min-umi-len allowing to filter reads that have too short UMIs --stats allowing user to specify a file instead of spitting into stdout, this could be slightly broken if user tries to filter reads out with --min-umi-len, needs more checking. --no-comment which I'm not sure is good idea/needed. I needed it for compatability with donwstream tools, basically this strips anything from the FASTQ header that is deliminated by space, i.e everything to the right of the white space regarded as comment

changed tabs to spaces, made tabs = 4 spaces instead of 8 reindented the code :| universe please forgive me...

Changed strncmp_with_mismatch function to now also take an extra param --max_5prime_crop that will attempt to trim bases, one at a time from 5prime of the read this should improve assignment rates if there is a short overhang, like one or two base. I should also set a TODO to assert that max_5prime_crop <= strlen(read) and/or strlen(barcode)

…s a little It now takes barcode and fastq reads, allowed mismatches and max 5prime crop, which is at most number of bases to attempt to crop from 5 prime of the read in order to find assignment of the barcode.

added new column of total percent of the library for each barcode

…ngth

one command (no sub commands i.e se or pe) instead have -m, --mode flag with several different mode options, all described in docs/mode.md

of week... From the diff and distance memory removed single_end menu options as I'm moving into sligntly different direction. One option is to have "mode" where single-end is just on of the modes.

This function no longer returns greater, equals or less then zero. Instead it simply looks for a barcode match, still allows 5 prime cropping, on match returns number of bases cropped, as this is important for downstream analysis in getting "actual barcode" sequence. Also remember that we are allowing mismatches in the barcode and perhaps we want that info later in the analysis somewhere. The other two important functions are get_fqread and get_merged_fqread both return a string that you can then write out. Originally I thought to just have get_fqread that then would take a mode and write "correct" fq string out. The idea behind get_merged_fqread function; in the case where R1 is just the barcode and R2 is just the read, we don't want to write out R1 since it holds no information, so simply merged two into one, appending barcode info into R2 header

i.e removing n bases from the back of the umi, where n = max-5prime-crop value

Also added redirect to a file umis_too_short.txt that now holds read names that were discarded due too short

e.g number of different barcode in a demultiplexed fastq given one mismatch was allowed also want similar metrics for umis

that returns unsorted table of barcodes and counts

are ought to be that length, reads with shorted umis will be thrown away, reads which have longer umis will be kept, but umi barcode will be trimmed back to min-umi-len to make all umis of uniform length, otherwise downstream analysis breaks - can't deduplicate if umis are of variable length

now can build dev version with make dev

I was removing one extra base from the end of the barcode Also refactored code for better redability and removed unwanted comments

sample barcodes OR metrics on umis. The implementation isn't robust, and there is some code duplication which will need to be refactored later on. Also because there are many more umis. I've seen upto 280k unique tags per sample, better memory management and samter search strategies will be needed

this wont compile but need to checkpoint the code. I'm scared to run gcc -Wall (0_0) ...

- droping kseq.h header can't multithread with that - wrote fastq parse - struct - pluged that in - updated Makefile

gziping by itself is slow as well and it is hard to tell what's better gzipping as I go or using stand alone pigz tools? I think the concern was that gzip ing worked better once the file stoped changing? - also in this commit some minor bug fixes

valgrind --leak-check=yes --track-origins=yes -v to get a dev version use make clean-all && make dev

this causes issues with conda packaging

at this stage this is more then appropriate

Error on missing barcode file

Update sabre.c

- Tries all barcodes up to max crop and max mismatch - Uses barcode that matches with fewest crop+mismatch

- Uses popen() to run either pigz or gzip

Find best matching barcode

Add option to compress (gzip) output files

to be "random" if statemnet inside switch

serine added 30 commits March 6, 2018 18:39

added gitignore

177430b

made barcode to be appended to the fastq header

2ab43e7

refactored help menu

c0b280e

updated README, included note about fork

de51ec2

very annoying commit, could be a deal breaker here...

5ac6256

changed tabs to spaces, made tabs = 4 spaces instead of 8 reindented the code :| universe please forgive me...

added exclusion of vim swap files into gitignore

d984204

updated Makefile to reflect removal of barcode.c file

c49b118

fixed the bug in new strncp_with_mismatch and changed input parameter…

fbf68a0

…s a little It now takes barcode and fastq reads, allowed mismatches and max 5prime crop, which is at most number of bases to attempt to crop from 5 prime of the read in order to find assignment of the barcode.

removed -std=c99 from make file

1790715

Changed layout of output stats file into a tab separated table

7ba39b9

added new column of total percent of the library for each barcode

updated error message about barcode length being greater than read le…

844ab17

…ngth

made new Makefile and added updated kseq.h file

d4e96a9

Setting myself up for mode feature. Planing to simplify sabre to be

b132954

one command (no sub commands i.e se or pe) instead have -m, --mode flag with several different mode options, all described in docs/mode.md

Added some docs, more like ideas at this stage

0b4d59e

Kind of forgotten what I was doing here, left in staging for a couple

5bce3cb

of week... From the diff and distance memory removed single_end menu options as I'm moving into sligntly different direction. One option is to have "mode" where single-end is just on of the modes.

updated Makefile that doesn't look at demulti_single.c file

cd0435e

making umis of uniform length based on max-5prime-crop.

e94e9ed

i.e removing n bases from the back of the umi, where n = max-5prime-crop value

attempting to fix memory leak

321c2ff

fixed bug in getting quality string length when using combine mode

e7e1865

fixed bug in skipping umis that are too short.

73eb266

Also added redirect to a file umis_too_short.txt that now holds read names that were discarded due too short

started working on metrics collection script.

204dfe3

e.g number of different barcode in a demultiplexed fastq given one mismatch was allowed also want similar metrics for umis

wrote functional metrics.c script

db72db1

that returns unsorted table of barcodes and counts

cleaned metrics.c code a little

fe1c1c6

worked on metrics util, not it produce sorted list

4f4f154

updated gitignore and makefile

22c994f

now can build dev version with make dev

fixed bug in making umi reads of a particular length

ded2d90

I was removing one extra base from the end of the barcode Also refactored code for better redability and removed unwanted comments

serine and others added 28 commits August 21, 2018 08:40

Updated makefile

1a07501

far out.. major revamp of sabre, just sabre code left and right

9e56e4a

this wont compile but need to checkpoint the code. I'm scared to run gcc -Wall (0_0) ...

milestone, got all headers in order?

0233441

work in progress, just another commit

a9af68a

individual c file compiles error free, check

a1af555

milestone majore.

af518d7

- droping kseq.h header can't multithread with that - wrote fastq parse - struct - pluged that in - updated Makefile

yet another milestone

98db795

milestone, compiles and runs

a17892b

check

cfc3bfa

huzzah!

82bcba9

tweak

3782b8b

check2

02a9c25

Polished off threads features, fixed all issues reaised by

850b924

valgrind --leak-check=yes --track-origins=yes -v to get a dev version use make clean-all && make dev

Stupid but necessary, reindented all C files to 4 spaces and no tabs!

8cf7501

update makefile to include commit hash into binary build

f96b234

fixed warning message and added git commit hash to the version print

5c071b8

updated readme

6d4172e

removed symlinking sabre binary after build from make file.

01eda92

this causes issues with conda packaging

simplified the versioning to be just a hash of the last commit

db0ee82

at this stage this is more then appropriate

Update sabre.c

c529c29

Error on missing barcode file

Merge pull request #1 from drpowell/patch-1

1250dd2

Update sabre.c

Find best matching barcode

a93381d

- Tries all barcodes up to max crop and max mismatch - Uses barcode that matches with fewest crop+mismatch

Add option to compress (gzip) output files

379b3c7

- Uses popen() to run either pigz or gzip

Merge pull request #2 from drpowell/best-match

c1b10b5

Find best matching barcode

Merge pull request #3 from drpowell/pigz

5678529

Add option to compress (gzip) output files

fixed printing of help when no args are given and also drop what appears

663365f

to be "random" if statemnet inside switch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix unassigned fastq files #17

Fix unassigned fastq files #17

Uh oh!

stu2 commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix unassigned fastq files #17

Are you sure you want to change the base?

Fix unassigned fastq files #17

Uh oh!

Conversation

stu2 commented Dec 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants