Skip to content

OpenAI CLIP tokenization support? #34

@kristofmaar

Description

@kristofmaar

First of all thanks for this great package!

I am trying to tokenize OpenAI CLIP text inputs (which I am not sure is even supported), from Huggingface models tokenizer.json files. Unfortunately, even though the processor (RobertaProcessing) seems to be supported, it always fails with a nil pointer panic in a postprocessing phase. With BERT style tokenizers, it works perfectly.

The respective tokenizer configs:
https://huggingface.co/openai/clip-vit-base-patch32/raw/main/tokenizer.json
or
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K/raw/main/tokenizer.json

Stack trace:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x102eb76e8]
goroutine 4 [running]:
testing.tRunner.func1()
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/testing/testing.go:1548 +0x528
panic({0x1034dd0a0?, 0x103b04eb0?})
	/opt/homebrew/Cellar/go/1.21.0/libexec/src/runtime/panic.go:920 +0x254
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).addSpecialToken(0x14006dc6a80, 0x0)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:133 +0x88
github.com/sugarme/tokenizer/processor.(*RobertaProcessing).Process(0x14006dc6a80, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/processor/roberta.go:96 +0xc8
github.com/sugarme/tokenizer.(*Tokenizer).PostProcess(0x14007979200, 0x1400954e820, 0x0, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:612 +0x230
github.com/sugarme/tokenizer.(*Tokenizer).Encode(0x14007979200, {0x1034f9660, 0x1400918cec0}, 0x1)
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:464 +0x560
github.com/sugarme/tokenizer.(*Tokenizer).EncodeSingle(0x14007979200, {0x103419fd0, 0x1}, {0x140067bd4bf, 0x1, 0x1})
	/Users/kristof/go/pkg/mod/github.com/sugarme/tokenizer@v0.2.2/tokenizer.go:1085 +0xcc

Is this intended in any way? If so, could you please consider supporting CLIP tokenizers?

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions