Conversation
Previous version gave a null pointer exception on termAtt. After that, it gave issues due to incorrect position increments. I decided to extend FilteringTokenFilter instead of TokenFilter to defer positioning and only handle accepting of tokens, which simplifies the code significantly.
There was a problem hiding this comment.
Pull request overview
This pull request refactors the StrongsNumberFilter class to fix null pointer exceptions and position increment issues by switching from extending TokenFilter to extending FilteringTokenFilter. The refactoring simplifies the implementation by delegating position management to the parent class and focusing solely on token acceptance logic.
Changes:
- Replaced parent class from
TokenFiltertoFilteringTokenFilter - Replaced complex
incrementToken()method with simpleraccept()method - Removed custom
equals()andhashCode()methods - Initialized
termAttattribute in constructor
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import org.apache.lucene.analysis.TokenFilter; | ||
| import org.apache.lucene.analysis.TokenStream; | ||
| import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; | ||
| import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; |
There was a problem hiding this comment.
The PositionIncrementAttribute is imported but never used in the code. This import should be removed as it serves no purpose in the refactored implementation.
| import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; |
| import java.io.IOException; | ||
|
|
||
| import org.apache.lucene.analysis.FilteringTokenFilter; | ||
| import org.apache.lucene.analysis.TokenFilter; |
There was a problem hiding this comment.
The TokenFilter import is unused since the class now extends FilteringTokenFilter instead of TokenFilter. This import should be removed.
| import org.apache.lucene.analysis.TokenFilter; |
|
|
||
| // There was no more input | ||
| return false; | ||
| } else { |
There was a problem hiding this comment.
There is a critical bug in the accept() method logic. When processing a Strong's Number with a part (line 63-66), the code calls number.isValid() on the existing StrongsNumber object that was created from a previous token. However, at line 65, it attempts to get the full Strong's number from the same object that was parsed from the PREVIOUS token, not the current token being processed. This will cause the filter to output incorrect token text because it's not using the current token's text at all in the else branch. The current token's text (stored in tokenText) is captured at line 55 but never used in the else branch.
| } else { | |
| } else { | |
| // Process the current token's text rather than reusing the previous StrongsNumber | |
| number = new StrongsNumber(tokenText); |
| public boolean accept() { | ||
| boolean valid; | ||
| String tokenText = termAtt.toString(); | ||
| if (number == null) { | ||
| // Need to loop over invalid tokens | ||
| while (input.incrementToken()) { | ||
| String tokenText = termAtt.toString(); | ||
|
|
||
| number = new StrongsNumber(tokenText); | ||
|
|
||
| // Skip invalid Strong's Numbers. | ||
| // Still need to return true as there may be more tokens to filter. | ||
| if (!number.isValid()) { | ||
| // TRANSLATOR: User error condition: Indicates that what was given is not a Strong's Number. {0} is a placeholder for the bad Strong's Number. | ||
| log.warn(JSMsg.gettext("Not a valid Strong's Number \"{0}\"", tokenText)); | ||
|
|
||
| // Go get the next token | ||
| continue; | ||
| } | ||
|
|
||
| String s = number.getStrongsNumber(); | ||
| termAtt.setEmpty().append(s); | ||
|
|
||
| // If the number had a part keep it around for the next call | ||
| // TODO(DMS): if there is a part, then treat as a synonym, | ||
| // setting the same position increment. | ||
| if (!number.isPart()) { | ||
| number = null; | ||
| } | ||
|
|
||
| // incrementToken returned a value. There may be more input. | ||
| return true; | ||
| number = new StrongsNumber(tokenText); | ||
| valid = number.isValid(); | ||
| termAtt.setEmpty().append(number.getStrongsNumber()); | ||
| if (!number.isPart()) { | ||
| number = null; | ||
| } | ||
|
|
||
| // There was no more input | ||
| return false; | ||
| } else { | ||
| valid = number.isValid(); | ||
| termAtt.setEmpty().append(number.getFullStrongsNumber()); | ||
| number = null; | ||
| } | ||
|
|
||
| // Process the Strong's number with the !a | ||
| termAtt.setEmpty().append(number.getFullStrongsNumber()); | ||
| // We are done with the Strong's Number so mark it as used | ||
| number = null; | ||
| // We are working on a value returned by incrementToken. | ||
| // There may be more input. | ||
| return true; | ||
| } | ||
|
|
||
| /* Define to quite FindBugs */ | ||
| @Override | ||
| public boolean equals(Object obj) { | ||
| return super.equals(obj); | ||
| } | ||
|
|
||
| /* Define to quite FindBugs */ | ||
| @Override | ||
| public int hashCode() { | ||
| return super.hashCode(); | ||
| if (!valid) { | ||
| log.warn(JSMsg.gettext("Not a valid Strong's Number \"{0}\"", tokenText)); | ||
| } | ||
| return valid; |
There was a problem hiding this comment.
The accept() method modifies the termAtt (the token text) before returning whether to accept or reject the token. This violates the contract of FilteringTokenFilter's accept() method, which should only determine whether to accept a token, not modify it. Token modifications should be done in a separate incrementToken() override that calls super.incrementToken() and then modifies the accepted token. Modifying the token text in accept() means that even rejected tokens will have their text modified, which could cause issues with the token stream state.
Previous version gave a null pointer exception on termAtt. After that, it gave issues due to incorrect position increments. I decided to extend FilteringTokenFilter instead of TokenFilter to defer positioning and only handle accepting of tokens, which simplifies the code significantly.