Skip to content

feat: support custom projector callback in compute_text_projection#170

Closed
peter-gy wants to merge 1 commit intoapple:mainfrom
peter-gy:feat/custom-text-projector-callback
Closed

feat: support custom projector callback in compute_text_projection#170
peter-gy wants to merge 1 commit intoapple:mainfrom
peter-gy:feat/custom-text-projector-callback

Conversation

@peter-gy
Copy link
Contributor

compute_text_projection currently handles the full text pipeline: text -> embeddings -> UMAP projection. That works well for the built-in litellm and sentence_transformer providers, but it makes it difficult to reuse embeddings that were already computed elsewhere. A common case is storing embeddings in systems like ChromaDB or LanceDB and wanting to explore them with Embedding Atlas.

Today, doing that requires users either

  • to recompute embeddings inside compute_text_projection or
  • to reimplement the UMAP and nearest-neighbor computation themselves.

This PR allows the text_projector arg of compute_text_projection to be a TextProjectorCallback in addition to the built-in string options. The callback can return precomputed embeddings or use any custom embedding implementation. Embedding Atlas still handles the rest of the projection flow, so users can plug in their own embeddings without giving up the existing projection setup and caching behavior.

@donghaoren
Copy link
Collaborator

Umm, have you tried the compute_vector_projection function? That takes pre-computed vectors as input.

@peter-gy
Copy link
Contributor Author

Oh wow, I have no idea how I missed that. I've been using compute_text_projection all along and totally ignored compute_vector_projection. Thanks!

I still see value in providing users the flexibility to specify a custom projector callback through this PR, but if that's not something you would like to support for now, please feel free to close this.

@domoritz
Copy link
Member

What use case would a custom callback have? We probably want to avoid having two ways to do the same thing (https://peps.python.org/pep-0020/).

@peter-gy
Copy link
Contributor Author

peter-gy commented Mar 12, 2026

What use case would a custom callback have? We probably want to avoid having two ways to do the same thing (https://peps.python.org/pep-0020/).

I see two main use cases:

  • using embedding models which are not exposed by sentence_transformers or litellm
  • invoking embedding models through a specific library not supported by Embedding Atlas, such as Pydantic AI or ChromaDB where they support instantiating embedding functions from a JSON manifest across Python, TS and Rust

@domoritz
Copy link
Member

Ah, I see. So instead of computing everything in compute_vector_projection, you would provide a callback function from some library/service to process a single projection.

@peter-gy
Copy link
Contributor Author

peter-gy commented Mar 12, 2026

Ah, I see. So instead of computing everything in compute_vector_projection, you would provide a callback function from some library/service to process a single projection.

Exactly. The TextProjectorCallback takes batch_size as an argument too, so it can compute projections in batches and technically does not need to process a single projection. This can be also useful in cases where some lab releases a new model (as Google did with Gemini Embedding 2.0 with multimodal inputs) and some features are only available properly with full support through their own SDK. In that case, users might want to just quickly wrap a google genai client call into a TextProjectorCallback. Remotely similar to how LLOOM supports custom LLM and Embedding APIs (https://stanfordhci.github.io/lloom/about/custom-models.html#ex-setup-functions-for-gemini)

@donghaoren
Copy link
Collaborator

Thanks for the explanation!

It looks like in _projection_for_texts the preprocessing is minimal, and everything is almost directly passed into TextProjectorCallback. There's a caching mechanism which could be useful, but as in "logger.warning" part of the PR says, the cache cannot guarantee that a custom TextProjectorCallback is properly hashed (I think even with a known function name, the user may still change the function contents, e.g., in a notebook), so it may result in stale cached results. Given that the cache won't be very reliable and it's not a lot more work to do:

df["vector"] = custom_text_projector(df["texts"], batch_size=batch_size, model=model)
compute_vector_projection(df, vector="vector", ...)

than

compute_text_projection(df, text="text", text_projector=custom_text_projector)

(unless the above doesn't work and it's tricky to pass the vectors around in the current API). I think it's probably better to not add this option?

btw, I think multimodal inputs is interesting to support, maybe we should just have one compute_projection function that can take multiple modalities (or perhaps even data with mixed modalities).

@peter-gy
Copy link
Contributor Author

Thanks, that makes sense. Given the caching caveats around arbitrary callbacks, I agree this would probably add more confusion than value right now, so I'm closing the PR.

btw, I think multimodal inputs is interesting to support, maybe we should just have one compute_projection function that can take multiple modalities (or perhaps even data with mixed modalities).

I'm experimenting with multimodal embeddings at the moment, so I'd be happy to circle back once I have a better sense of what kind of support would actually be useful in Embedding Atlas.

@peter-gy peter-gy closed this Mar 13, 2026
@peter-gy peter-gy deleted the feat/custom-text-projector-callback branch March 13, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants