Hi,
My goal is to train Sparse Autoencoders (SAEs). If the team can share the training dataset in huggingface, then it would be possible to train SAEs that learn features from the training dataset
The problem:
- To train SAEs, you need to use the dataset that was used to train SONAR (i.e., SAEs requires the same data distribution), or the dataset that was used to train NLLB. However, the
allenai/nllb dataset available in Huggingface is noisy

Hi,
My goal is to train Sparse Autoencoders (SAEs). If the team can share the training dataset in huggingface, then it would be possible to train SAEs that learn features from the training dataset
The problem:
allenai/nllbdataset available in Huggingface is noisy