-
Notifications
You must be signed in to change notification settings - Fork 137
Description
Hello,
I want to create embeddings from my own data based upon the Tahoe100M embeddings as reference. I have been creating a config.yaml as detailed in (https://github.com/ArcInstitute/state/blob/main/src/state/configs/state-defaults.yaml#L175). I receive an error pointing to the yaml file section pertaining to
embeddings.current.esm2-cellxgene-tahoe. In the example yaml from the link above
esm2-cellxgene-tahoe:
all_embeddings: /large_storage/ctc/ML/data/cell/misc/Homo_sapiens.GRCh38.gene_symbol_to_embedding_ESM2.pt
ds_emb_mapping: /large_storage/ctc/datasets/updated1_gene_embidx_mapping_tahoe_basecamp_cellxgene.torch
valid_genes_masks: /large_storage/ctc/datasets/updated1_valid_gene_index_tahoe_basecamp_cellxgene.torch
where may I find the corresponding tensor files for Tahoe100M? Are there any additional instructions on how to do this? I have looked at the state github readme but this is out-of-date. Pretty please and thanks!
Below is the yaml I have been working on:
In order to run: state emb fit --conf /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/zeroshot_infer.yaml > /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/sc_atlas_embeddings.h5ad
Also, Can you please explain what is expected for these key-value pairs above? is it a dictionary, tensor file, or something else? Below the error defines it as "dict"? I am very confused about what is expected here. Any help would be greatly appreciated. Please and thanks! :)
Error1:
(base) almaplaza-rodriguez@Almas-MacBook-Air Alma_code % state emb fit --conf /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/zeroshot_infer.yaml > /Users/almaplaza-rodriguez/Documents/Huang_lab/Alma_code/sc_atlas_embeddings.h5ad
Traceback (most recent call last):
File "/Users/almaplaza-rodriguez/.local/bin/state", line 10, in
sys.exit(main())
^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/main.py", line 103, in main
run_emb_fit(cfg, args)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/_cli/_emb/_fit.py", line 54, in run_emb_fit
trainer_main(cfg)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/emb/train/trainer.py", line 39, in main
train_dataset_sentence_collator = VCIDatasetSentenceCollator(cfg, is_train=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/state/emb/data/loader.py", line 273, in init
gene_mask_file = utils.get_embedding_cfg(self.cfg).valid_genes_masks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 355, in getattr
self._format_and_raise(
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/base.py", line 231, in _format_and_raise
format_and_raise(
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/_utils.py", line 899, in format_and_raise
_raise(ex, cause)
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/_utils.py", line 797, in _raise
raise ex.with_traceback(sys.exc_info()[2]) # set env var OC_CAUSE=1 for full trace
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 351, in getattr
return self._get_impl(
^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 442, in _get_impl
node = self._get_child(
^^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/basecontainer.py", line 73, in _get_child
child = self._get_node(
^^^^^^^^^^^^^^^
File "/Users/almaplaza-rodriguez/.local/share/uv/tools/arc-state/lib/python3.11/site-packages/omegaconf/dictconfig.py", line 480, in _get_node
raise ConfigKeyError(f"Missing key {key!s}")
omegaconf.errors.ConfigAttributeError: Missing key valid_genes_masks
full_key: embeddings.esm2-cellxgene-tahoe.valid_genes_masks
object_type=dict