Conversation
source/fasta/fasta.d
Outdated
| { | ||
| pos += chain.window.length; | ||
| chain.extend(0); | ||
| } |
There was a problem hiding this comment.
Don't do this on every token, as it only needs to be done at the beginning. Maybe do it outside the Result struct, in the function before you return it (just after extending for one line).
There was a problem hiding this comment.
Steve @schveiguy: Moving the lines out of the nextToken() does not work if I keep storage type of Result as static. I need to pass in lines and starting pos Which can’t be determined at compile time.
So possible choices:
- Remove
staticFlag and initialize pos with the position where emtpy lines ends. This will reduce performance. - Move declaration of
posup one level, out ofResult. This however will loseposbetween calls tonextToken? - Somehow store current reading position internally in
Chain. This means updateiopipe. I don’t know what’s the consequence of this. Probably it will makeChainthread unsafe?
I am not at my computer so I can’t try it out now.
There was a problem hiding this comment.
There's no reason to keep the window if it's completely empty, you are just skipping it anyway.
So the pos should always start out as 0.
It should work like this:
lines.extend(0); // in code already
while (lines.window.strip.empty)
{
lines.release(lines.window.length);
lines.extend(0);
}
return Result(lines); // in code alreadyThere was a problem hiding this comment.
Thanks Steve. I did not know release can be used this way.
source/fastq/fastq.d
Outdated
| assert(pos <= buf.length); | ||
| assert(pos + length <= buf.length); | ||
| return buf[pos .. pos + length]; | ||
| } |
There was a problem hiding this comment.
This is identical to the one in fasta. Move this to a common file.
|
Going to do a more thorough review later. Haven't got the time right now. |
| { | ||
| auto lines = c.byLine; | ||
| alias ChainType = typeof(lines); | ||
| size_t lineCount = 0; |
There was a problem hiding this comment.
This is unused, probably leftover from a previous version?
There was a problem hiding this comment.
Yes that's a leftover. I read through you reviews, so I will not try to edit the code before further update from you. Thanks for going through the code.
| if(pos == chain.window.length) | ||
| return FastqToken.init; | ||
| logf("Pos: %s, window length: %s, window: %s", pos, chain.window.length, chain.window ); | ||
| assert(chain.window[pos] == '@', "Got this: " ~ chain.window[pos]); |
There was a problem hiding this comment.
I'm rethinking this part of the parser -- assert is not the right tool here, probably we should use enforce, since the incoming data is not program data, but file data. In D, when you want to verify program correctness, you use assert. when you want to verify environment or user-supplied data, you use enforce.
I know I did this in the fasta parser, it should be changed there too.
There was a problem hiding this comment.
Agree. I was also thinking about this when I was listening Ali talking about "contract programming" last week. So, enforce and some "contract" or not necessary?
| if(!lazyArr.empty) | ||
| { | ||
| auto firstElem = lazyArr.front; | ||
| result.entryid = BufRef(pos, firstElem.length); |
There was a problem hiding this comment.
So looking at this and the format specification, you are first splitting on space, and then sub-splitting on colon. However, you are assuming the first item doesn't have any sub-items. I'm not sure what the intention is, or what the actual file format you are encountering, but this seems to be not correct with the given example above
For instance:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
entryid will be set to @EAS139:136:FC706VJ:2:2104:15343:197393
and the sub-fields will be parsed as 1, Y, 18, ATCACG.
Not sure if this is what you are looking for, or maybe the actual format looks different in practice?
EDIT: I see that you are indeed parsing the first field as well, but entryid will still be set to what I said above, it's just that you will get all the subfields as hfields and the other fields as extras. I think I see what you are doing here, I think you need to set the entryid later, if you are trying to store it as EAS139, but maybe I'm wrong.
There was a problem hiding this comment.
EAS139 is instrument ID, and it is repeated across all entries and files produced by the instrument. So it is not a good sequence ID. You are right, the entry ID is still EAS139:136:FC706VJ:2:2104:15343:197393, and it is correct. For library users, hfields and extras are something useful, but not always. In some other case, users may just take the whole header line and focus more onto the sequence and quality. So, I am thinking for performance optimization later, we can make header parsing part optional.
| { | ||
| result.extras ~= BufRef(pos, f.length); | ||
| } | ||
| pos += f.length + 1; |
There was a problem hiding this comment.
This is so ugly, I hate the way this works (same in fasta). I want to create a new mechanism to parse out BufRefs, I'll work on that, and then you can use it in both places for much better parsing.
There was a problem hiding this comment.
I'm going to stop reviewing here, because with a new mechanism for parsing, this code is going to look a lot different (better).
There was a problem hiding this comment.
Looking forward to the new mechanism. I don't not have to use this for my work right now so I can wait a bit.
Hi Steve
Main part of this PR is mainly about committing the library for FASTQ format. I also handled the case in which empty lines at the beginning of the input fails the parsers.
The FASTQ library needs review and optimize, so I push it up to get your input.
Thanks