Skip to content

Fastq parser#3

Open
biocyberman wants to merge 4 commits intoschveiguy:masterfrom
bioslaD:dev
Open

Fastq parser#3
biocyberman wants to merge 4 commits intoschveiguy:masterfrom
bioslaD:dev

Conversation

@biocyberman
Copy link
Contributor

@biocyberman biocyberman commented May 31, 2018

Hi Steve

Main part of this PR is mainly about committing the library for FASTQ format. I also handled the case in which empty lines at the beginning of the input fails the parsers.

The FASTQ library needs review and optimize, so I push it up to get your input.

Thanks

{
pos += chain.window.length;
chain.extend(0);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this on every token, as it only needs to be done at the beginning. Maybe do it outside the Result struct, in the function before you return it (just after extending for one line).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Contributor Author

@biocyberman biocyberman Jun 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steve @schveiguy: Moving the lines out of the nextToken() does not work if I keep storage type of Result as static. I need to pass in lines and starting pos Which can’t be determined at compile time.

So possible choices:

  • Remove static Flag and initialize pos with the position where emtpy lines ends. This will reduce performance.
  • Move declaration of pos up one level, out of Result. This however will lose pos between calls to nextToken?
  • Somehow store current reading position internally in Chain. This means update iopipe. I don’t know what’s the consequence of this. Probably it will make Chain thread unsafe?

I am not at my computer so I can’t try it out now.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to keep the window if it's completely empty, you are just skipping it anyway.

So the pos should always start out as 0.

It should work like this:

    lines.extend(0); // in code already
    while (lines.window.strip.empty)
    {
        lines.release(lines.window.length);
        lines.extend(0);
    }
    return Result(lines); // in code already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Steve. I did not know release can be used this way.

assert(pos <= buf.length);
assert(pos + length <= buf.length);
return buf[pos .. pos + length];
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to the one in fasta. Move this to a common file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@schveiguy
Copy link
Owner

Going to do a more thorough review later. Haven't got the time right now.

{
auto lines = c.byLine;
alias ChainType = typeof(lines);
size_t lineCount = 0;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused, probably leftover from a previous version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's a leftover. I read through you reviews, so I will not try to edit the code before further update from you. Thanks for going through the code.

if(pos == chain.window.length)
return FastqToken.init;
logf("Pos: %s, window length: %s, window: %s", pos, chain.window.length, chain.window );
assert(chain.window[pos] == '@', "Got this: " ~ chain.window[pos]);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rethinking this part of the parser -- assert is not the right tool here, probably we should use enforce, since the incoming data is not program data, but file data. In D, when you want to verify program correctness, you use assert. when you want to verify environment or user-supplied data, you use enforce.

I know I did this in the fasta parser, it should be changed there too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I was also thinking about this when I was listening Ali talking about "contract programming" last week. So, enforce and some "contract" or not necessary?

if(!lazyArr.empty)
{
auto firstElem = lazyArr.front;
result.entryid = BufRef(pos, firstElem.length);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at this and the format specification, you are first splitting on space, and then sub-splitting on colon. However, you are assuming the first item doesn't have any sub-items. I'm not sure what the intention is, or what the actual file format you are encountering, but this seems to be not correct with the given example above

For instance:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

entryid will be set to @EAS139:136:FC706VJ:2:2104:15343:197393
and the sub-fields will be parsed as 1, Y, 18, ATCACG.

Not sure if this is what you are looking for, or maybe the actual format looks different in practice?

EDIT: I see that you are indeed parsing the first field as well, but entryid will still be set to what I said above, it's just that you will get all the subfields as hfields and the other fields as extras. I think I see what you are doing here, I think you need to set the entryid later, if you are trying to store it as EAS139, but maybe I'm wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EAS139 is instrument ID, and it is repeated across all entries and files produced by the instrument. So it is not a good sequence ID. You are right, the entry ID is still EAS139:136:FC706VJ:2:2104:15343:197393, and it is correct. For library users, hfields and extras are something useful, but not always. In some other case, users may just take the whole header line and focus more onto the sequence and quality. So, I am thinking for performance optimization later, we can make header parsing part optional.

{
result.extras ~= BufRef(pos, f.length);
}
pos += f.length + 1;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so ugly, I hate the way this works (same in fasta). I want to create a new mechanism to parse out BufRefs, I'll work on that, and then you can use it in both places for much better parsing.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to stop reviewing here, because with a new mechanism for parsing, this code is going to look a lot different (better).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to the new mechanism. I don't not have to use this for my work right now so I can wait a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants