Skip to content

Fix strongs tokenizing filter#20

Open
JJK96 wants to merge 1 commit intodevelopfrom
fix_strongsnumberfilter
Open

Fix strongs tokenizing filter#20
JJK96 wants to merge 1 commit intodevelopfrom
fix_strongsnumberfilter

Conversation

@JJK96
Copy link

@JJK96 JJK96 commented Nov 13, 2025

Previous version gave a null pointer exception on termAtt. After that, it gave issues due to incorrect position increments. I decided to extend FilteringTokenFilter instead of TokenFilter to defer positioning and only handle accepting of tokens, which simplifies the code significantly.

Previous version gave a null pointer exception on termAtt.
After that, it gave issues due to incorrect position increments.
I decided to extend FilteringTokenFilter instead of TokenFilter to
defer positioning and only handle accepting of tokens, which simplifies
the code significantly.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request refactors the StrongsNumberFilter class to fix null pointer exceptions and position increment issues by switching from extending TokenFilter to extending FilteringTokenFilter. The refactoring simplifies the implementation by delegating position management to the parent class and focusing solely on token acceptance logic.

Changes:

  • Replaced parent class from TokenFilter to FilteringTokenFilter
  • Replaced complex incrementToken() method with simpler accept() method
  • Removed custom equals() and hashCode() methods
  • Initialized termAtt attribute in constructor

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PositionIncrementAttribute is imported but never used in the code. This import should be removed as it serves no purpose in the refactored implementation.

Suggested change
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

Copilot uses AI. Check for mistakes.
import java.io.IOException;

import org.apache.lucene.analysis.FilteringTokenFilter;
import org.apache.lucene.analysis.TokenFilter;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TokenFilter import is unused since the class now extends FilteringTokenFilter instead of TokenFilter. This import should be removed.

Suggested change
import org.apache.lucene.analysis.TokenFilter;

Copilot uses AI. Check for mistakes.

// There was no more input
return false;
} else {
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a critical bug in the accept() method logic. When processing a Strong's Number with a part (line 63-66), the code calls number.isValid() on the existing StrongsNumber object that was created from a previous token. However, at line 65, it attempts to get the full Strong's number from the same object that was parsed from the PREVIOUS token, not the current token being processed. This will cause the filter to output incorrect token text because it's not using the current token's text at all in the else branch. The current token's text (stored in tokenText) is captured at line 55 but never used in the else branch.

Suggested change
} else {
} else {
// Process the current token's text rather than reusing the previous StrongsNumber
number = new StrongsNumber(tokenText);

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +71
public boolean accept() {
boolean valid;
String tokenText = termAtt.toString();
if (number == null) {
// Need to loop over invalid tokens
while (input.incrementToken()) {
String tokenText = termAtt.toString();

number = new StrongsNumber(tokenText);

// Skip invalid Strong's Numbers.
// Still need to return true as there may be more tokens to filter.
if (!number.isValid()) {
// TRANSLATOR: User error condition: Indicates that what was given is not a Strong's Number. {0} is a placeholder for the bad Strong's Number.
log.warn(JSMsg.gettext("Not a valid Strong's Number \"{0}\"", tokenText));

// Go get the next token
continue;
}

String s = number.getStrongsNumber();
termAtt.setEmpty().append(s);

// If the number had a part keep it around for the next call
// TODO(DMS): if there is a part, then treat as a synonym,
// setting the same position increment.
if (!number.isPart()) {
number = null;
}

// incrementToken returned a value. There may be more input.
return true;
number = new StrongsNumber(tokenText);
valid = number.isValid();
termAtt.setEmpty().append(number.getStrongsNumber());
if (!number.isPart()) {
number = null;
}

// There was no more input
return false;
} else {
valid = number.isValid();
termAtt.setEmpty().append(number.getFullStrongsNumber());
number = null;
}

// Process the Strong's number with the !a
termAtt.setEmpty().append(number.getFullStrongsNumber());
// We are done with the Strong's Number so mark it as used
number = null;
// We are working on a value returned by incrementToken.
// There may be more input.
return true;
}

/* Define to quite FindBugs */
@Override
public boolean equals(Object obj) {
return super.equals(obj);
}

/* Define to quite FindBugs */
@Override
public int hashCode() {
return super.hashCode();
if (!valid) {
log.warn(JSMsg.gettext("Not a valid Strong's Number \"{0}\"", tokenText));
}
return valid;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accept() method modifies the termAtt (the token text) before returning whether to accept or reject the token. This violates the contract of FilteringTokenFilter's accept() method, which should only determine whether to accept a token, not modify it. Token modifications should be done in a separate incrementToken() override that calls super.incrementToken() and then modifies the accepted token. Modifying the token text in accept() means that even rejected tokens will have their text modified, which could cause issues with the token stream state.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants