Skip to content

Conversation

@ArneBab
Copy link
Contributor

@ArneBab ArneBab commented Jul 25, 2024

This avoids very common text encoding problems.

@Bombe
Copy link
Contributor

Bombe commented Jul 25, 2024

You have not avoided the very common test writing problem! 😄

@ArneBab
Copy link
Contributor Author

ArneBab commented Sep 22, 2024

You have not avoided the very common test writing problem! 😄

I had also not avoided the very common "my change does not have any effect and a test would have shown that" problem 😓

Now it’s fixed: our plain text filter actually detects the charset from the BOM and uses UTF-8 by default.

if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE];
int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer);
BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m pretty sure this is 100% wrong. That method detects an encoding from the representation of the string @charset. It is also gloriously misnamed as it has nothing to do with a BOM. 😀

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree with @Bombe here. See my other comment for something that does appear to work.

@ArneBab ArneBab force-pushed the content-filter--text-utf8 branch from 2827f9e to 748fdae Compare November 11, 2024 07:24
@bertm
Copy link
Contributor

bertm commented May 11, 2025

Both new tests fail on my machine, not sure why they work on CI.

Comment on lines 345 to 349
if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
int bufferSize = handler.charsetExtractor.getCharsetBufferSize();
input.mark(bufferSize);
byte[] charsetBuffer = new byte[bufferSize];
int bytesRead = 0, offset = 0, toread=0;
while(true) {
toread = bufferSize - offset;
bytesRead = input.read(charsetBuffer, offset, toread);
if(bytesRead == -1 || toread == 0) break;
offset += bytesRead;
}
input.reset();
int offset = readIntoBuffer(input, bufferSize, charsetBuffer);
charset = detectCharset(charsetBuffer, offset, handler, maybeCharset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the correct solution to this problem is moving this block of code right before the if(handler.readFilter != null) check: text/plain does not have a readFilter, but does takesACharset so this would run the detectCharset appropriately.

Few things to consider:

  • handler.charsetExtractor.getCharsetBufferSize() will NPE, so we need to choose the bufferSize to the max BOM length (5?) when handler.charsetExtractor is absent.
  • Alternatively a dummy CharsetExtractor can be used that always fails to detect a charset, so the BOM one is used automagically.
  • this will return UTF-8 rather than utf-8 so the related test would need some adjustment.

if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) {
byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE];
int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer);
BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I agree with @Bombe here. See my other comment for something that does appear to work.

byte[] buf = { (byte) 0xef, (byte) 0xbb, (byte) 0xbf, 0x40 };
ArrayBucket out = new ArrayBucket();
FilterStatus fo = ContentFilter.filter(new ArrayBucket(buf).getInputStream(), out.getOutputStream(), "text/plain", null, null, null);
assertTrue("utf-8".equals(fo.charset));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use assertThat(actual, equalTo(expected)) or assertEquals(expected, actual) for checking equality - this just yields a non-descriptive AssertionError when it fails instead of showing what the actual value was.

@ArneBab
Copy link
Contributor Author

ArneBab commented Nov 8, 2025

#1109 does part of the work of this PR with just a single line change.

I now think the correct way to deal with this here would be to set the charset when detecting text/plain. That’s then needed both in pyFreenet and other utils and in fred. Basically always set utf-8, if that can correctly decode the text.

@ArneBab ArneBab closed this Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants