-
Notifications
You must be signed in to change notification settings - Fork 58
Open
Description
First of all thanks for this great package!
I am trying to tokenize OpenAI CLIP text inputs (which I am not sure is even supported), from Huggingface models tokenizer.json files. Unfortunately, even though the processor (RobertaProcessing) seems to be supported, it always fails with a nil pointer panic in a postprocessing phase. With BERT style tokenizers, it works perfectly.
The respective tokenizer configs:
https://huggingface.co/openai/clip-vit-base-patch32/raw/main/tokenizer.json
or
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/raw/main/tokenizer.json
Stack trace:
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x102eb76e8]
goroutine 4 [running]:
testing.tRunner.func1()
/opt/homebrew/Cellar/go/1.21.0/libexec/src/testing/testing.go:1548 +0x528
panic({0x1034dd0a0?, 0x103b04eb0?})
/opt/homebrew/Cellar/go/1.21.0/libexec/src/runtime/panic.go:920 +0x254
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).addSpecialToken(0x14006dc6a80, 0x0)
/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:133 +0x88
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).Process(0x14006dc6a80, 0x1400954e820, 0x0, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:96 +0xc8
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0x14007979200, 0x1400954e820, 0x0, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:612 +0x230
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x14007979200, {0x1034f9660, 0x1400918cec0}, 0x1)
/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:464 +0x560
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingle(0x14007979200, {0x103419fd0, 0x1}, {0x140067bd4bf, 0x1, 0x1})
/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:1085 +0xcc
Is this intended in any way? If so, could you please consider supporting CLIP tokenizers?
Thanks in advance.
Metadata
Metadata
Assignees
Labels
No labels