Incremental addition of the new modality

### 🚀 The feature, motivation and pitch

🤗  Hello! Thank you for your work!

I see model configurations which working with certain modalities in this repo and it is great.

I have a question though, what if I have **pretrained encoder for other modality** (e.g. for audio) and a **data** for training (audio-text pairs and audio-image pairs).

- How can I train a model which will be able to solve tasks with my new modality?
- In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?



### Alternatives

_No response_

### Additional context

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental addition of the new modality #390

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incremental addition of the new modality #390

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions