Fastq parser by biocyberman · Pull Request #3 · schveiguy/fastaq

biocyberman · 2018-05-31T13:33:18Z

Hi Steve

Main part of this PR is mainly about committing the library for FASTQ format. I also handled the case in which empty lines at the beginning of the input fails the parsers.

The FASTQ library needs review and optimize, so I push it up to get your input.

Thanks

schveiguy · 2018-05-31T14:00:01Z

source/fasta/fasta.d

+              {
+                pos += chain.window.length;
+                chain.extend(0);
+              }


Don't do this on every token, as it only needs to be done at the beginning. Maybe do it outside the Result struct, in the function before you return it (just after extending for one line).

Steve @schveiguy: Moving the lines out of the nextToken() does not work if I keep storage type of Result as static. I need to pass in lines and starting pos Which can’t be determined at compile time.

So possible choices:

Remove static Flag and initialize pos with the position where emtpy lines ends. This will reduce performance.

Move declaration of pos up one level, out of Result. This however will lose pos between calls to nextToken?

Somehow store current reading position internally in Chain. This means update iopipe. I don’t know what’s the consequence of this. Probably it will make Chain thread unsafe?

I am not at my computer so I can’t try it out now.

There's no reason to keep the window if it's completely empty, you are just skipping it anyway.

So the pos should always start out as 0.

It should work like this:

lines.extend(0); // in code already while (lines.window.strip.empty) { lines.release(lines.window.length); lines.extend(0); } return Result(lines); // in code already

Thanks Steve. I did not know release can be used this way.

schveiguy · 2018-05-31T15:47:40Z

source/fastq/fastq.d

+    assert(pos <= buf.length);
+    assert(pos + length <= buf.length);
+    return buf[pos .. pos + length];
+  }


This is identical to the one in fasta. Move this to a common file.

schveiguy · 2018-05-31T15:49:46Z

Going to do a more thorough review later. Haven't got the time right now.

schveiguy · 2018-06-23T11:49:00Z

source/fastq/fastq.d

+  {
+    auto lines = c.byLine;
+    alias ChainType = typeof(lines);
+    size_t lineCount = 0;


This is unused, probably leftover from a previous version?

Yes that's a leftover. I read through you reviews, so I will not try to edit the code before further update from you. Thanks for going through the code.

schveiguy · 2018-06-23T11:52:35Z

source/fastq/fastq.d

+        if(pos == chain.window.length)
+          return FastqToken.init;
+        logf("Pos: %s, window length: %s, window: %s", pos, chain.window.length, chain.window );
+        assert(chain.window[pos] == '@', "Got this: " ~ chain.window[pos]);


I'm rethinking this part of the parser -- assert is not the right tool here, probably we should use enforce, since the incoming data is not program data, but file data. In D, when you want to verify program correctness, you use assert. when you want to verify environment or user-supplied data, you use enforce.

I know I did this in the fasta parser, it should be changed there too.

Agree. I was also thinking about this when I was listening Ali talking about "contract programming" last week. So, enforce and some "contract" or not necessary?

schveiguy · 2018-06-23T12:00:55Z

source/fastq/fastq.d

+        if(!lazyArr.empty)
+          {
+            auto firstElem = lazyArr.front;
+            result.entryid = BufRef(pos, firstElem.length);


So looking at this and the format specification, you are first splitting on space, and then sub-splitting on colon. However, you are assuming the first item doesn't have any sub-items. I'm not sure what the intention is, or what the actual file format you are encountering, but this seems to be not correct with the given example above

For instance:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

entryid will be set to @EAS139:136:FC706VJ:2:2104:15343:197393
and the sub-fields will be parsed as 1, Y, 18, ATCACG.

Not sure if this is what you are looking for, or maybe the actual format looks different in practice?

EDIT: I see that you are indeed parsing the first field as well, but entryid will still be set to what I said above, it's just that you will get all the subfields as hfields and the other fields as extras. I think I see what you are doing here, I think you need to set the entryid later, if you are trying to store it as EAS139, but maybe I'm wrong.

EAS139 is instrument ID, and it is repeated across all entries and files produced by the instrument. So it is not a good sequence ID. You are right, the entry ID is still EAS139:136:FC706VJ:2:2104:15343:197393, and it is correct. For library users, hfields and extras are something useful, but not always. In some other case, users may just take the whole header line and focus more onto the sequence and quality. So, I am thinking for performance optimization later, we can make header parsing part optional.

schveiguy · 2018-06-23T12:10:36Z

source/fastq/fastq.d

+                      {
+                        result.extras ~= BufRef(pos, f.length);
+                      }
+                    pos += f.length + 1;


This is so ugly, I hate the way this works (same in fasta). I want to create a new mechanism to parse out BufRefs, I'll work on that, and then you can use it in both places for much better parsing.

I'm going to stop reviewing here, because with a new mechanism for parsing, this code is going to look a lot different (better).

Looking forward to the new mechanism. I don't not have to use this for my work right now so I can wait a bit.

biocyberman added 2 commits May 31, 2018 13:59

Deal with whitespaces at the beginning of input files

d349c86

First working fastq parser :)

6102845

schveiguy reviewed May 31, 2018

View reviewed changes

biocyberman added 2 commits June 2, 2018 23:27

Relocate code of skipping emtpy lines

87ec9da

Refactor code organization

9e934c2

biocyberman force-pushed the dev branch from afbc55d to 9e934c2 Compare June 2, 2018 21:32

schveiguy reviewed Jun 23, 2018

View reviewed changes

Conversation

biocyberman commented May 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

biocyberman Jun 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schveiguy commented May 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

biocyberman commented May 31, 2018 •

edited

Loading

biocyberman Jun 1, 2018 •

edited

Loading