-
Notifications
You must be signed in to change notification settings - Fork 231
Provide text with UTF-8 MIME Type by default #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
You have not avoided the very common test writing problem! 😄 |
I had also not avoided the very common "my change does not have any effect and a test would have shown that" problem 😓 Now it’s fixed: our plain text filter actually detects the charset from the BOM and uses UTF-8 by default. |
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE]; | ||
| int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer); | ||
| BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m pretty sure this is 100% wrong. That method detects an encoding from the representation of the string @charset. It is also gloriously misnamed as it has nothing to do with a BOM. 😀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree with @Bombe here. See my other comment for something that does appear to work.
This avoids very common text encoding problems.
2827f9e to
748fdae
Compare
|
Both new tests fail on my machine, not sure why they work on CI. |
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| int bufferSize = handler.charsetExtractor.getCharsetBufferSize(); | ||
| input.mark(bufferSize); | ||
| byte[] charsetBuffer = new byte[bufferSize]; | ||
| int bytesRead = 0, offset = 0, toread=0; | ||
| while(true) { | ||
| toread = bufferSize - offset; | ||
| bytesRead = input.read(charsetBuffer, offset, toread); | ||
| if(bytesRead == -1 || toread == 0) break; | ||
| offset += bytesRead; | ||
| } | ||
| input.reset(); | ||
| int offset = readIntoBuffer(input, bufferSize, charsetBuffer); | ||
| charset = detectCharset(charsetBuffer, offset, handler, maybeCharset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the correct solution to this problem is moving this block of code right before the if(handler.readFilter != null) check: text/plain does not have a readFilter, but does takesACharset so this would run the detectCharset appropriately.
Few things to consider:
handler.charsetExtractor.getCharsetBufferSize()will NPE, so we need to choose the bufferSize to the max BOM length (5?) whenhandler.charsetExtractoris absent.- Alternatively a dummy CharsetExtractor can be used that always fails to detect a charset, so the BOM one is used automagically.
- this will return
UTF-8rather thanutf-8so the related test would need some adjustment.
| if(handler.takesACharset && ((charset == null) || (charset.isEmpty()))) { | ||
| byte[] charsetBuffer = new byte[CHARSET_DETECTION_FALLBACK_BUFFERSIZE]; | ||
| int offset = readIntoBuffer(input, CHARSET_DETECTION_FALLBACK_BUFFERSIZE, charsetBuffer); | ||
| BOMDetection bom = CSSReadFilter.detectCharsetFromBOM(charsetBuffer, CHARSET_DETECTION_FALLBACK_BUFFERSIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I agree with @Bombe here. See my other comment for something that does appear to work.
| byte[] buf = { (byte) 0xef, (byte) 0xbb, (byte) 0xbf, 0x40 }; | ||
| ArrayBucket out = new ArrayBucket(); | ||
| FilterStatus fo = ContentFilter.filter(new ArrayBucket(buf).getInputStream(), out.getOutputStream(), "text/plain", null, null, null); | ||
| assertTrue("utf-8".equals(fo.charset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use assertThat(actual, equalTo(expected)) or assertEquals(expected, actual) for checking equality - this just yields a non-descriptive AssertionError when it fails instead of showing what the actual value was.
|
#1109 does part of the work of this PR with just a single line change. I now think the correct way to deal with this here would be to set the charset when detecting text/plain. That’s then needed both in pyFreenet and other utils and in fred. Basically always set utf-8, if that can correctly decode the text. |
This avoids very common text encoding problems.