Skip to content

Add missing HF functionalities #170

Merged
mergennachin merged 25 commits intometa-pytorch:mainfrom
benITo47:@ben/improvements
Feb 11, 2026
Merged

Add missing HF functionalities #170
mergennachin merged 25 commits intometa-pytorch:mainfrom
benITo47:@ben/improvements

Conversation

@benITo47
Copy link
Contributor

While trying to migrate ReactNative-Executorch from hf-tokenizers cpp bindings to your implementation, I found several issues, mostly with parsing and applying tokeniser.json config.

This PR mitigates some of them:

  • Prepend normaliser is handled
  • PreTokenizer explicitly set to null is handled
  • ByteFallback flag is respected
  • Added skip_special_tokens flag to decode function
  • Added piece_to_id public method
  • Fixed some CPP issues

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.
Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.
This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'
Added:
  - Handling of pretokenizer field explicitly set to null
  - Handling of bytefallback field along with encode logic
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026
@benITo47
Copy link
Contributor Author

cc: @larryliu0820 @mergennachin

@mergennachin
Copy link
Contributor

mergennachin commented Jan 30, 2026

I don't like that this change is a breaking change. One we update the tokenizer pin in ExecuTorch, it will break all the call sites

What is your thought of doing something like this?

  class Tokenizer {
   public:
    // New batch API (pure virtual)
    virtual Result<std::string> decode(
        const std::vector<uint64_t>& tokens,
        bool skip_special_tokens = false) const = 0;

    // Old streaming API - non-breaking, delegates to batch
    // Subclasses can override if they need prev_token context
    virtual Result<std::string> decode(uint64_t prev_token, uint64_t token) const {
      (void)prev_token;  // Default: ignore prev_token like BPE does
      return decode({token}, false);
    }
  };

  // SPTokenizer overrides to preserve prev_token behavior
  class SPTokenizer : public Tokenizer {
   public:
    Result<std::string> decode(uint64_t prev_token, uint64_t token) const override {
       ...
    }
  };

@benITo47
Copy link
Contributor Author

benITo47 commented Feb 4, 2026

I don't like that this change is a breaking change.

I agree. Honestly, at first I only GREPed executorch repository for CPPHFTokenizer, and found no issues.
I reverted those changes, now batch decoding is available as overload only in the hf_tokenizer, similar to your snippet.

Please take a look again :-)

Also, I have a custom test suite, that I was using to test whether tokenizers for models we provide in rn-execeutorch are working as expected. Would you be interested in getting this suite? It's kinda .json heavy and I would need to strip a lot of content so I don't push 500k lines. Let me know.

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline, also add tests for skip_special_tokens and piece_to_id logic.

@mergennachin
Copy link
Contributor

There's some syntax error in pybind, according to the unit test failure.

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.
This commit adds test cases for:
- PieceToId logic
- skip_special_tokens logic
- PrependNormalizer
Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, @benITo47

I have a few more requests. See inline

if (id != -1) {
return static_cast<uint64_t>(id);
} else {
TK_LOG(Error, "Piece '%s' not found in vocabulary", text.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it an Error or could it be downgrade to Info or Debug?

@mergennachin mergennachin merged commit 3f97c6b into meta-pytorch:main Feb 11, 2026
6 checks passed
@mergennachin
Copy link
Contributor

Thank you @benITo47

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants