Skip to content

Conversation

@stevedlawrence
Copy link
Member

When we need to parse a specified length string, we currently allocate a buffer that can be reused to store the decoded string. The size of this buffer is based on the maximumSimpleElementSizeInCharacters tunable, which defaults to a fairly large size (1MB) that can be slow and put added pressure on the garbage collector. Fortunately, this buffer is allocated using a LocalBuffer so it is reused during a parse so at worst there is only one allocation per parse. But when parsing many small files that contain specified length strings, this overhead can become noticable. And 1MB is likely orders of magnitude larger than the vast majority of data formats will need for any single string element.

To address this, instead of using maximumSimpleElementSizeInCharacters, we calculate how many characters the string could possible decode to given the current bit position, bit limit, and encoding, and use that as the buffer size to request. This way we only ever request and allocate a large buffer is one is ever needed, which should be rare.

Note that this new logic requires bitLimit as part of specified string parsing. That isn't available in the edge case of specified length complex nillables. The specified length nil parser is modified to handle this case.

This also modifies the LocalBuffer to allocate buffers of a reasonably large minimum size of 1K. This way we will likely only ever need to allocate a single buffer rather than allocating small buffers that have to be reallocate as larger buffers are needed.

Tested with small NITF files (<4000 bytes) that contain lots of fixed length strings, this saw about 30%+ performance improvements. Files tested as large as 8000 bytes saw little or no change in performance.

DAFFODIL-2851

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes look correct. But I suspect we're just allocating lots of 1K buffers now instead of lots of 1M buffers. I.e., the reuse mechanism is suspect.

(!hasSpecifiedLength && start.dataInputStream.hasData)
) {
// Fail!
PE(start, "%s - Does not contain a nil literal!", eName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest remove the "!" exclamation point. No need for this emphasis.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from the below messages. Agreed that this isn't standard with our normal phrasing of errors--I'll remove it from the others too.

// calculate the maximum number of characters that we could possibly decode based on the
// number of a available bits and the character set. Note that some character sets have
// variable width encodings, so we can't always know for sure how many characters will be
// found. But we can calculate a maximum based on the smallest possible code unit for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels backward to me. Don't you need the maximum width so you can allocate larger when there are variable-width characters possible and the string may contain N of those? I.e., so for UTF8 you want to allow the worst-case of 4-bytes per character, not 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code isn't calculating the maximum width, it's calculating the maximum number of characters that the available number of bits/bytes could decode to. With UTF8, we get the maximum number of characters with the minimum single character length. So N bytes could potentially decode to N characters. It might decode to less characters if there are any multibte chars, but it can't possible decode to more than N characters.

I'll update this comment to make this more clear.

} else if (isFieldNilLit(field)) {
// Contains a nilValue, Success ParseResult indicates nilled
if (erd.isComplexType) {
// nillable complex types must have a nilValue of %ES;. For a literal nil specified length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine.

I just ask does logic about '%ES;' and complex nil values exist elsewhere as well?
I ask because there is a longstanding enhancement request for DFDL to allow more than just '%ES;' as nil values for complex types.

The DFDL Workgroup was uncertain where this restriction originated, but some members suspected there is a reason for it that has just been lost to history. Hence, if this feature were to be added it would be experimental until proven ok.

In any case this code is fine as it was already doing complex type nil logic, so it's not like you added another place to the code that reasons about this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is it does, but it might be hard to find (I couldn't find anything via a quick scan). For example, the previous code didn't really have anything specific to complex nils, it just happened to work because it would either find data or it wouldn't and the nilValue for complex was %ES which matches empty.

It think we ever do add support for non-%ES complex nils, we'll just have to be very thorough about testing.

// problem quickly allocating it
val minBufferSize = 1024
val allocationSize = math.max(length.toInt, minBufferSize)
tempBuf = Maybe(allocate(allocationSize))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting why this improves performance.

It sounds to me like the reuse mechanism in this code was not working since if it was, your improvement here would not have helped, and that suggests that there is more improvement on the table if we get the reuse of these buffers to work better, or work at all, since I suspect it's not working at all.

I suggest: first take out all the reuse - just allocate and let the GC reclaim exactly-sized buffers and see if that's faster/slower. (checkpoint on the reuse mechanism overhead vs. just allocate+GC)

Second, instrument the reuse of these buffers so we count exactly how many of these are allocated vs. reused from the stack/pool. My bet is the reuse isn't working.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reuse worked (I'll double check as you suggest), the issue is that the reuse is only within a single parse--it's only local to the PState. So every time we call parse() we will allocate a new 1MB buffer (if there is a specific length string in the format), and reuse that buffer for the lifetime of that single parse. When that parse ends, we let it get garbage collected along with the PState, and a new one will be allocated the next time parse() is called.

For files that are very small (e.g. <1KB) the 1 MB allocation makes a noticable difference. For larger files this doesn't make as much of a difference because 1MB isn't very much compared to all the other allocations daffodil does.

An alternative would be to make these a thread local and then share them among threads, then we could go back to the 1MB buffer size and you won't only take a single hit per thread. But I don't think the 1MB buffer size is all that important. It's going to be very rare for a single specified length string to be very big, likely no where close to 1MB.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check if with vs without reuse changed performance, but I did add a print statement every time we call this getBuf function printed if we have to allocate a new buffer or not. Testing with NITF (which has lots of specified length strings and some relatively small files available), it allocated a 1024 byte buffer once and then reused it about 100+ times with no additional allocates. We might need additional testing to make sure it's working correctly everywhere for other formats, or it might be worth investigating other reuse (e.g. make LocalBuffer reuse among threads), but I think the LocalBuffer reuse logic is working as intended.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, the particular LocalBuffer that was related to the 1MB specified length string allocation is actually from the InputSourceDataInputStream, not the PState. So technically it's only a single 1MB allocation per ISDIS created. If you are streaming data and reusing the same ISDIS you'll only have have one allocation. But if you create a new ISDIS per parse (which non-streaming use cases will do), then you will effectively get one allocation per parse() call.

So this performance only improves relatively small files that hvae specified length strings that for non-streaming data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested allocating the exact size needed vs allocating one large buffer and reusing. I really didn't see much of a difference. Allocate once + reuse was about the same or like 3-5% faster, but that could be within the JVM noise.

It's also possible it's dependent on the specific file and format. For very small formats with little reuse it's probably faster to just allocate exact sizes as needed. But if there are a lot of strings, maybe avoiding the allocations and GC overhead makes a differences. I can't really say without more testing, but I don't think there's a clear winner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Thanks for double checking that the reuse is at least working. When we call parse repeatedly in a loop such as to parse a stream of messages, is that going to allocate this over and over, or are we reusing the PState in that case.

I answered my own question... we have to reuse the Pstate because the I/O position on the input has to be preserved.

However, other applications will call parse once per input record because they are carving off the input data record/message themselves based on UDP datagram size or other means. In that case those are brand new calls to parse with new PState allocated. This is also a fairly tight loop, so yes, eliminating this 1MB allocation and GC for each parse call should be an improvement.

We should consider pooling entire PState objects and resetting them so that we don't have to reallocate any of this stuff.

Copy link
Contributor

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@stevedlawrence
Copy link
Member Author

Another data point: I just came across a real format that is just a length field followed by a single explicit length string payload that's usually around 800 bytes. This is basically the perfect scenario to see results from this fix. In Daffodil 4.0.0, the file averages about 15ms to to parse. With this fix it averages about 0.04ms. So a very significantly speedup.

Note that one can see similar results in older versions of daffodil by reducing the maximumSimpleElementSizeInCharacters tunable the maximum size of a simple element you expect to see.

Copy link
Contributor

@olabusayoT olabusayoT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

When we need to parse a specified length string, we currently allocate a
buffer that can be reused to store the decoded string. The size of this
buffer is based on the maximumSimpleElementSizeInCharacters tunable,
which defaults to a fairly large size (1MB) that can be slow and put
added pressure on the garbage collector. Fortunately, this buffer is
allocated using a LocalBuffer so it is reused during a parse so at worst
there is only one allocation per parse. But when parsing many small
files that contain specified length strings, this overhead can become
noticable. And 1MB is likely orders of magnitude larger than the vast
majority of data formats will need for any single string element.

To address this, instead of using maximumSimpleElementSizeInCharacters,
we calculate how many characters the string could possible decode to
given the current bit position, bit limit, and encoding, and use that as
the buffer size to request. This way we only ever request and allocate a
large buffer is one is ever needed, which should be rare.

Note that this new logic requires bitLimit as part of specified string
parsing. That isn't available in the edge case of specified length
complex nillables. The specified length nil parser is modified to
handle this case.

This also modifies the LocalBuffer to allocate buffers of a reasonably
large minimum size of 1K. This way we will likely only ever need to
allocate a single buffer rather than allocating small buffers that have
to be reallocate as larger buffers are needed.

Tested with small NITF files (<4000 bytes) that contain lots of fixed
length strings, this saw about 30%+ performance improvements. Files
tested as large as 8000 bytes saw little or no change in performance.

DAFFODIL-2851
@stevedlawrence stevedlawrence force-pushed the daffodil-2851-specified-length-string-allocations branch from 9763ecc to e3a8de1 Compare December 19, 2025 12:59
@stevedlawrence stevedlawrence merged commit d28fb52 into apache:main Dec 19, 2025
11 checks passed
@stevedlawrence stevedlawrence deleted the daffodil-2851-specified-length-string-allocations branch December 19, 2025 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants