OpenAI CLIP tokenization support?

First of all thanks for this great package!

I am trying to tokenize OpenAI CLIP text inputs (which I am not sure is even supported), from Huggingface models tokenizer.json files. Unfortunately, even though the processor (RobertaProcessing) seems to be supported, it always fails with a nil pointer panic in a postprocessing phase. With BERT style tokenizers, it works perfectly.

The respective tokenizer configs:
https://huggingface.co/openai/clip-vit-base-patch32/raw/main/tokenizer.json
or
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/raw/main/tokenizer.json

Stack trace:
```
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x102eb76e8]
goroutine 4 [running]:
testing.tRunner.func1()
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/testing/testing.go:1548 +0x528
panic({0x1034dd0a0?, 0x103b04eb0?})
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/runtime/panic.go:920 +0x254
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).addSpecialToken(0x14006dc6a80, 0x0)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:133 +0x88
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).Process(0x14006dc6a80, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:96 +0xc8
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0x14007979200, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:612 +0x230
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x14007979200, {0x1034f9660, 0x1400918cec0}, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:464 +0x560
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingle(0x14007979200, {0x103419fd0, 0x1}, {0x140067bd4bf, 0x1, 0x1})
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:1085 +0xcc
```

Is this intended in any way? If so, could you please consider supporting CLIP tokenizers?

Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenAI CLIP tokenization support? #34

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

OpenAI CLIP tokenization support? #34

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions