Add missing HF functionalities #170
Conversation
This commit introduces prepend normalizer, similiar to hugging faces's rust implementation.
Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true.
This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece'
Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic
Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.
|
I don't like that this change is a breaking change. One we update the tokenizer pin in ExecuTorch, it will break all the call sites What is your thought of doing something like this? |
…ar tokens" This reverts commit 08e1b39.
d3b88e6 to
64a5dbe
Compare
I agree. Honestly, at first I only GREPed executorch repository for CPPHFTokenizer, and found no issues. Please take a look again :-) Also, I have a custom test suite, that I was using to test whether tokenizers for models we provide in rn-execeutorch are working as expected. Would you be interested in getting this suite? It's kinda .json heavy and I would need to strip a lot of content so I don't push 500k lines. Let me know. |
mergennachin
left a comment
There was a problem hiding this comment.
See inline, also add tests for skip_special_tokens and piece_to_id logic.
|
There's some syntax error in pybind, according to the unit test failure. |
This commit, removes BertProcessor and RobertaProcessor skeleton classes.
This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer
mergennachin
left a comment
There was a problem hiding this comment.
Great, @benITo47
I have a few more requests. See inline
src/llama2c_tokenizer.cpp
Outdated
| if (id != -1) { | ||
| return static_cast<uint64_t>(id); | ||
| } else { | ||
| TK_LOG(Error, "Piece '%s' not found in vocabulary", text.c_str()); |
There was a problem hiding this comment.
Is it an Error or could it be downgrade to Info or Debug?
|
Thank you @benITo47 |
While trying to migrate ReactNative-Executorch from hf-tokenizers cpp bindings to your implementation, I found several issues, mostly with parsing and applying tokeniser.json config.
This PR mitigates some of them: