Add missing HF functionalities by benITo47 · Pull Request #170 · meta-pytorch/tokenizers

benITo47 · 2026-01-28T18:05:04Z

While trying to migrate ReactNative-Executorch from hf-tokenizers cpp bindings to your implementation, I found several issues, mostly with parsing and applying tokeniser.json config.

This PR mitigates some of them:

Prepend normaliser is handled
PreTokenizer explicitly set to null is handled
ByteFallback flag is respected
Added skip_special_tokens flag to decode function
Added piece_to_id public method
Fixed some CPP issues

This commit introduces prepend normalizer, similiar to hugging faces's rust implementation.

Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true.

This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece'

Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic

benITo47 · 2026-01-28T18:24:13Z

cc: @larryliu0820 @mergennachin

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

mergennachin · 2026-01-30T21:50:06Z

I don't like that this change is a breaking change. One we update the tokenizer pin in ExecuTorch, it will break all the call sites

What is your thought of doing something like this?

  class Tokenizer {
   public:
    // New batch API (pure virtual)
    virtual Result<std::string> decode(
        const std::vector<uint64_t>& tokens,
        bool skip_special_tokens = false) const = 0;

    // Old streaming API - non-breaking, delegates to batch
    // Subclasses can override if they need prev_token context
    virtual Result<std::string> decode(uint64_t prev_token, uint64_t token) const {
      (void)prev_token;  // Default: ignore prev_token like BPE does
      return decode({token}, false);
    }
  };

  // SPTokenizer overrides to preserve prev_token behavior
  class SPTokenizer : public Tokenizer {
   public:
    Result<std::string> decode(uint64_t prev_token, uint64_t token) const override {
       ...
    }
  };

…ar tokens" This reverts commit 08e1b39.

benITo47 · 2026-02-04T09:07:51Z

I don't like that this change is a breaking change.

I agree. Honestly, at first I only GREPed executorch repository for CPPHFTokenizer, and found no issues.
I reverted those changes, now batch decoding is available as overload only in the hf_tokenizer, similar to your snippet.

Please take a look again :-)

Also, I have a custom test suite, that I was using to test whether tokenizers for models we provide in rn-execeutorch are working as expected. Would you be interested in getting this suite? It's kinda .json heavy and I would need to strip a lot of content so I don't push 500k lines. Let me know.

mergennachin

See inline, also add tests for skip_special_tokens and piece_to_id logic.

include/pytorch/tokenizers/hf_tokenizer.h

include/pytorch/tokenizers/post_processor.h

include/pytorch/tokenizers/token_decoder.h

include/pytorch/tokenizers/post_processor.h

CMakeLists.txt

src/sentencepiece.cpp

src/llama2c_tokenizer.cpp

src/normalizer.cpp

mergennachin · 2026-02-04T20:35:03Z

There's some syntax error in pybind, according to the unit test failure.

This commit, removes BertProcessor and RobertaProcessor skeleton classes.

This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer

mergennachin

Great, @benITo47

I have a few more requests. See inline

CMakeLists.txt

include/pytorch/tokenizers/token_decoder.h

include/pytorch/tokenizers/post_processor.h

mergennachin · 2026-02-09T17:45:20Z

src/llama2c_tokenizer.cpp

+  if (id != -1) {
+    return static_cast<uint64_t>(id);
+  } else {
+    TK_LOG(Error, "Piece '%s' not found in vocabulary", text.c_str());


Is it an Error or could it be downgrade to Info or Debug?

src/python_bindings.cpp

src/token_decoder.cpp

test/test_token_decoder.cpp

test/test_hf_tokenizer.cpp

…post_processor.h

mergennachin · 2026-02-11T22:24:33Z

Thank you @benITo47

benITo47 added 4 commits January 28, 2026 15:53

[FEAT] Add prepend normalizer

841e012

This commit introduces prepend normalizer, similiar to hugging faces's rust implementation.

[FEAT] Add "skip_special_tokens" parameter to the decode function

93ff6dc

Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true.

[FEAT] Add funciton "piece_to_id"

66bbba3

This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece'

[FEAT] Add handling of null pretokenizer and bytefallback json fields

a62c78e

Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 28, 2026

benITo47 added 4 commits January 29, 2026 16:19

[FIX] Changed decode API to work on vectors instead of singular tokens

08e1b39

[REFACTOR] Changed tests to reflect new decode API

8e724f2

[FIX] Change decoders to work on vectors

1faa969

[FEAT] Make postprocessing a separte step

53a1f96

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

mergennachin requested review from larryliu0820 and mergennachin January 30, 2026 21:34

benITo47 added 5 commits February 3, 2026 08:27

Revert "[FIX] Changed decode API to work on vectors instead of singul…

6999612

…ar tokens" This reverts commit 08e1b39.

[REFACTOR] Split loading function of HFTokenizer

e1c9e72

Revert tests as there's no longer vectorized decode api

19c717d

[FIX] Fix handling of unknown tokens in bpm

53cad2e

[FIX] Added FuseDecoder implementation

64a5dbe

benITo47 force-pushed the @ben/improvements branch from d3b88e6 to 64a5dbe Compare February 4, 2026 08:59

Merge branch 'main' into @ben/improvements

309b9d1

mergennachin reviewed Feb 4, 2026

View reviewed changes

benITo47 added 3 commits February 4, 2026 22:47

Fix python bindings

3b7d282

[FIX] post_processor, remove silent fails

1b77e2b

This commit, removes BertProcessor and RobertaProcessor skeleton classes.

chore: Add test cases

eb157ca

This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer

mergennachin reviewed Feb 9, 2026

View reviewed changes

benITo47 added 3 commits February 11, 2026 21:11

chore: add python binding for batch decode

e84017c

chore: add post_processor to BUCK file

b36a45c

chore: fix formatting in token_decoder.h, remove placeholder code in …

75b0cdf

…post_processor.h

benITo47 added 5 commits February 11, 2026 21:31

chore: change copyright handle to SWM

66e004c

feat: add tests requested in review

964c084

chore: Unify logs in piece_to_id definitions

69862cf

chore: fix tests to ensure parity with rust implementaiton outputs

c895aa5

chore: add python test for batch_decode

e89386d

benITo47 requested a review from mergennachin February 11, 2026 21:09

mergennachin approved these changes Feb 11, 2026

View reviewed changes

mergennachin merged commit 3f97c6b into meta-pytorch:main Feb 11, 2026
6 checks passed

Conversation

benITo47 commented Jan 28, 2026

Uh oh!

benITo47 commented Jan 28, 2026

Uh oh!

mergennachin commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benITo47 commented Feb 4, 2026

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Feb 4, 2026

Uh oh!

mergennachin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mergennachin commented Jan 30, 2026 •

edited

Loading