Skip to content

Where may I find the Tahoe100M embedding tensor files for "state emb fit --config"? #237

@aplazar1

Description

@aplazar1

Hello,
I want to create embeddings from my own data based upon the Tahoe100M embeddings as reference. I have been creating a config.yaml as detailed in (https://github.com/ArcInstitute/state/blob/main/src/state/configs/state-defaults.yaml#L175). I receive an error pointing to the yaml file section pertaining to

embeddings.current.esm2-cellxgene-tahoe. In the example yaml from the link above

esm2-cellxgene-tahoe:
all_embeddings: /large_storage/ctc/ML/data/cell/misc/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt
ds_emb_mapping: /large_storage/ctc/datasets/updated1_gene_embidx_mapping_tahoe_basecamp_cellxgene.torch
valid_genes_masks: /large_storage/ctc/datasets/updated1_valid_gene_index_tahoe_basecamp_cellxgene.torch

where may I find the corresponding tensor files for Tahoe100M? Are there any additional instructions on how to do this? I have looked at the state github readme but this is out-of-date. Pretty please and thanks!
Below is the yaml I have been working on:

zeroshot_infer.yaml

In order to run: state emb fit --conf /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/zeroshot_infer.yaml > /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/sc_atlas_embeddings.h5ad

Also, Can you please explain what is expected for these key-value pairs above? is it a dictionary, tensor file, or something else? Below the error defines it as "dict"? I am very confused about what is expected here. Any help would be greatly appreciated. Please and thanks! :)
Error1:
(base) almaplaza-rodriguez@Almas-MacBook-Air Alma_code % state emb fit --conf /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/zeroshot_infer.yaml > /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/sc_atlas_embeddings.h5ad

Traceback (most recent call last):
File "/Users/almaplaza-rodriguez/.local/bin/state", line 10, in
sys.exit(main())
^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/main.py", line 103, in main
run_emb_fit(cfg, args)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/_cli/_emb/_fit.py", line 54, in run_emb_fit
trainer_main(cfg)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/emb/train/trainer.py", line 39, in main
train_dataset_sentence_collator = VCIDatasetSentenceCollator(cfg, is_train=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/emb/data/loader.py", line 273, in init
gene_mask_file = utils.get_embedding_cfg(self.cfg).valid_genes_masks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 355, in getattr
self._format_and_raise(
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 351, in getattr
return self._get_impl(
^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
node = self._get_child(
^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
child = self._get_node(
^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key valid_genes_masks
full_key: embeddings.esm2-cellxgene-tahoe.valid_genes_masks
object_type=dict

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions