-
Notifications
You must be signed in to change notification settings - Fork 71
Avoid large allocations related to specified length strings #1603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid large allocations related to specified length strings #1603
Conversation
mbeckerle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes look correct. But I suspect we're just allocating lots of 1K buffers now instead of lots of 1M buffers. I.e., the reuse mechanism is suspect.
| (!hasSpecifiedLength && start.dataInputStream.hasData) | ||
| ) { | ||
| // Fail! | ||
| PE(start, "%s - Does not contain a nil literal!", eName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest remove the "!" exclamation point. No need for this emphasis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied from the below messages. Agreed that this isn't standard with our normal phrasing of errors--I'll remove it from the others too.
| // calculate the maximum number of characters that we could possibly decode based on the | ||
| // number of a available bits and the character set. Note that some character sets have | ||
| // variable width encodings, so we can't always know for sure how many characters will be | ||
| // found. But we can calculate a maximum based on the smallest possible code unit for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels backward to me. Don't you need the maximum width so you can allocate larger when there are variable-width characters possible and the string may contain N of those? I.e., so for UTF8 you want to allow the worst-case of 4-bytes per character, not 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code isn't calculating the maximum width, it's calculating the maximum number of characters that the available number of bits/bytes could decode to. With UTF8, we get the maximum number of characters with the minimum single character length. So N bytes could potentially decode to N characters. It might decode to less characters if there are any multibte chars, but it can't possible decode to more than N characters.
I'll update this comment to make this more clear.
| } else if (isFieldNilLit(field)) { | ||
| // Contains a nilValue, Success ParseResult indicates nilled | ||
| if (erd.isComplexType) { | ||
| // nillable complex types must have a nilValue of %ES;. For a literal nil specified length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks fine.
I just ask does logic about '%ES;' and complex nil values exist elsewhere as well?
I ask because there is a longstanding enhancement request for DFDL to allow more than just '%ES;' as nil values for complex types.
The DFDL Workgroup was uncertain where this restriction originated, but some members suspected there is a reason for it that has just been lost to history. Hence, if this feature were to be added it would be experimental until proven ok.
In any case this code is fine as it was already doing complex type nil logic, so it's not like you added another place to the code that reasons about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess is it does, but it might be hard to find (I couldn't find anything via a quick scan). For example, the previous code didn't really have anything specific to complex nils, it just happened to work because it would either find data or it wouldn't and the nilValue for complex was %ES which matches empty.
It think we ever do add support for non-%ES complex nils, we'll just have to be very thorough about testing.
| // problem quickly allocating it | ||
| val minBufferSize = 1024 | ||
| val allocationSize = math.max(length.toInt, minBufferSize) | ||
| tempBuf = Maybe(allocate(allocationSize)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not getting why this improves performance.
It sounds to me like the reuse mechanism in this code was not working since if it was, your improvement here would not have helped, and that suggests that there is more improvement on the table if we get the reuse of these buffers to work better, or work at all, since I suspect it's not working at all.
I suggest: first take out all the reuse - just allocate and let the GC reclaim exactly-sized buffers and see if that's faster/slower. (checkpoint on the reuse mechanism overhead vs. just allocate+GC)
Second, instrument the reuse of these buffers so we count exactly how many of these are allocated vs. reused from the stack/pool. My bet is the reuse isn't working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reuse worked (I'll double check as you suggest), the issue is that the reuse is only within a single parse--it's only local to the PState. So every time we call parse() we will allocate a new 1MB buffer (if there is a specific length string in the format), and reuse that buffer for the lifetime of that single parse. When that parse ends, we let it get garbage collected along with the PState, and a new one will be allocated the next time parse() is called.
For files that are very small (e.g. <1KB) the 1 MB allocation makes a noticable difference. For larger files this doesn't make as much of a difference because 1MB isn't very much compared to all the other allocations daffodil does.
An alternative would be to make these a thread local and then share them among threads, then we could go back to the 1MB buffer size and you won't only take a single hit per thread. But I don't think the 1MB buffer size is all that important. It's going to be very rare for a single specified length string to be very big, likely no where close to 1MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check if with vs without reuse changed performance, but I did add a print statement every time we call this getBuf function printed if we have to allocate a new buffer or not. Testing with NITF (which has lots of specified length strings and some relatively small files available), it allocated a 1024 byte buffer once and then reused it about 100+ times with no additional allocates. We might need additional testing to make sure it's working correctly everywhere for other formats, or it might be worth investigating other reuse (e.g. make LocalBuffer reuse among threads), but I think the LocalBuffer reuse logic is working as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, the particular LocalBuffer that was related to the 1MB specified length string allocation is actually from the InputSourceDataInputStream, not the PState. So technically it's only a single 1MB allocation per ISDIS created. If you are streaming data and reusing the same ISDIS you'll only have have one allocation. But if you create a new ISDIS per parse (which non-streaming use cases will do), then you will effectively get one allocation per parse() call.
So this performance only improves relatively small files that hvae specified length strings that for non-streaming data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested allocating the exact size needed vs allocating one large buffer and reusing. I really didn't see much of a difference. Allocate once + reuse was about the same or like 3-5% faster, but that could be within the JVM noise.
It's also possible it's dependent on the specific file and format. For very small formats with little reuse it's probably faster to just allocate exact sizes as needed. But if there are a lot of strings, maybe avoiding the allocations and GC overhead makes a differences. I can't really say without more testing, but I don't think there's a clear winner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Thanks for double checking that the reuse is at least working. When we call parse repeatedly in a loop such as to parse a stream of messages, is that going to allocate this over and over, or are we reusing the PState in that case.
I answered my own question... we have to reuse the Pstate because the I/O position on the input has to be preserved.
However, other applications will call parse once per input record because they are carving off the input data record/message themselves based on UDP datagram size or other means. In that case those are brand new calls to parse with new PState allocated. This is also a fairly tight loop, so yes, eliminating this 1MB allocation and GC for each parse call should be an improvement.
We should consider pooling entire PState objects and resetting them so that we don't have to reallocate any of this stuff.
mbeckerle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Another data point: I just came across a real format that is just a length field followed by a single explicit length string payload that's usually around 800 bytes. This is basically the perfect scenario to see results from this fix. In Daffodil 4.0.0, the file averages about 15ms to to parse. With this fix it averages about 0.04ms. So a very significantly speedup. Note that one can see similar results in older versions of daffodil by reducing the |
olabusayoT
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
When we need to parse a specified length string, we currently allocate a buffer that can be reused to store the decoded string. The size of this buffer is based on the maximumSimpleElementSizeInCharacters tunable, which defaults to a fairly large size (1MB) that can be slow and put added pressure on the garbage collector. Fortunately, this buffer is allocated using a LocalBuffer so it is reused during a parse so at worst there is only one allocation per parse. But when parsing many small files that contain specified length strings, this overhead can become noticable. And 1MB is likely orders of magnitude larger than the vast majority of data formats will need for any single string element. To address this, instead of using maximumSimpleElementSizeInCharacters, we calculate how many characters the string could possible decode to given the current bit position, bit limit, and encoding, and use that as the buffer size to request. This way we only ever request and allocate a large buffer is one is ever needed, which should be rare. Note that this new logic requires bitLimit as part of specified string parsing. That isn't available in the edge case of specified length complex nillables. The specified length nil parser is modified to handle this case. This also modifies the LocalBuffer to allocate buffers of a reasonably large minimum size of 1K. This way we will likely only ever need to allocate a single buffer rather than allocating small buffers that have to be reallocate as larger buffers are needed. Tested with small NITF files (<4000 bytes) that contain lots of fixed length strings, this saw about 30%+ performance improvements. Files tested as large as 8000 bytes saw little or no change in performance. DAFFODIL-2851
9763ecc to
e3a8de1
Compare
When we need to parse a specified length string, we currently allocate a buffer that can be reused to store the decoded string. The size of this buffer is based on the maximumSimpleElementSizeInCharacters tunable, which defaults to a fairly large size (1MB) that can be slow and put added pressure on the garbage collector. Fortunately, this buffer is allocated using a LocalBuffer so it is reused during a parse so at worst there is only one allocation per parse. But when parsing many small files that contain specified length strings, this overhead can become noticable. And 1MB is likely orders of magnitude larger than the vast majority of data formats will need for any single string element.
To address this, instead of using maximumSimpleElementSizeInCharacters, we calculate how many characters the string could possible decode to given the current bit position, bit limit, and encoding, and use that as the buffer size to request. This way we only ever request and allocate a large buffer is one is ever needed, which should be rare.
Note that this new logic requires bitLimit as part of specified string parsing. That isn't available in the edge case of specified length complex nillables. The specified length nil parser is modified to handle this case.
This also modifies the LocalBuffer to allocate buffers of a reasonably large minimum size of 1K. This way we will likely only ever need to allocate a single buffer rather than allocating small buffers that have to be reallocate as larger buffers are needed.
Tested with small NITF files (<4000 bytes) that contain lots of fixed length strings, this saw about 30%+ performance improvements. Files tested as large as 8000 bytes saw little or no change in performance.
DAFFODIL-2851