Skip to content

E1 tokenizer path not handled correctly #20

@Hrovatin

Description

@Hrovatin

I downloaded E1 files from HF and used them as following:

AutoModelForMaskedLM.from_pretrained(
    '/DOWNLOAD_PATH/Synthyra_Profluent-E1-600M', 
    trust_remote_code=True, 
    local_files_only=True
)

However, this causes re-download of tokenizer files (even if I request only local files).
The issue is that when tokenizer file is sought after, the current execution path rather than the path given to the AutoModel is used:

fname = os.path.join(os.path.dirname(__file__), "tokenizer.json")

Thus the file is not found as it is sought in my code execution directory and not the directory with the model.

This is the output of the above command (I added print statement of the inferred tokenizer file from the above line)

Compiling flex attention
Fetching 7 files: 100%|██████████| 7/7 [00:00<00:00, 13148.29it/s]
[/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json](https://file+.vscode-resource.vscode-cdn.net/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json)
E1 Tokenizer not found in local directory, downloading from Hugging Face
[/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json](https://file+.vscode-resource.vscode-cdn.net/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json)
E1 Tokenizer not found in local directory, downloading from Hugging Face
[/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json](https://file+.vscode-resource.vscode-cdn.net/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json)
E1 Tokenizer not found in local directory, downloading from Hugging Face
[/CWD/cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json](https://file+.vscode-resource.vscode-cdn.net/CWD/.cache/huggingface/modules/transformers_modules/Synthyra_Profluent-E1-600M/tokenizer.json)
E1 Tokenizer not found in local directory, downloading from Hugging Face

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions