CSM-1b on RunPod Serverless

This project sets up Sesame's CSM-1b speech generation model on RunPod's serverless infrastructure, allowing you to generate high-quality speech from text prompts through an API.

Prerequisites

A Hugging Face account with access to sesame/csm-1b
A RunPod account with API access
Docker installed on your local machine

Testing Your Hugging Face Token

Before building the Docker image, verify that your Hugging Face token has access to the CSM-1b model:

pip install huggingface_hub
python test_hf_token.py --token YOUR_HF_TOKEN

If successful, you'll see confirmation that your token can access the model.

Building the Docker Image

Clone this repository:

git clone <your-repo-url>
cd <your-repo-directory>

Build the Docker image:

docker build -t csm-1b-runpod .

Pushing to Docker Hub

Tag your image:

docker tag csm-1b-runpod yourusername/csm-1b-runpod:latest

Push to Docker Hub:

docker login
docker push yourusername/csm-1b-runpod:latest

Setting Up the RunPod Serverless Endpoint

Log in to your RunPod account
Go to Serverless > Endpoints
Click "New Endpoint"
Select "Docker Hub" and enter your image URL: yourusername/csm-1b-runpod:latest
Choose a GPU type (A10 or better recommended)
Set min/max workers based on your needs
Add the following environment variable:
- Key: HF_TOKEN
- Value: YOUR_HUGGING_FACE_TOKEN
Create the endpoint and wait for it to activate

Using the Client

The included client script lets you interact with your serverless endpoint:

pip install requests pydub
python client.py --text "Hello from Sesame!" --endpoint YOUR_ENDPOINT_ID --api-key YOUR_RUNPOD_API_KEY

You can also set environment variables instead of passing arguments:

export RUNPOD_ENDPOINT_ID=YOUR_ENDPOINT_ID
export RUNPOD_API_KEY=YOUR_RUNPOD_API_KEY
python client.py --text "Hello from Sesame!"

API Format

Your serverless endpoint accepts the following JSON input:

{
  "input": {
    "text": "Hello, this is a test.",
    "speaker": 0,
    "max_audio_length_ms": 10000,
    "context_audios": ["https://example.com/audio1.wav"],
    "context_texts": ["Previous utterance"],
    "context_speakers": [1]
  }
}

The response will include:

{
  "audio_base64": "base64-encoded WAV audio",
  "sample_rate": 24000,
  "duration_seconds": 1.5
}

Troubleshooting

Model Loading Errors: Ensure your Hugging Face token has accepted the model terms.
CUDA Out of Memory: Try a larger GPU instance or reduce the max audio length.
Slow Generation: The initial request will be slower as the model loads into memory.

License

This project uses the Apache-2.0 license, in accordance with the CSM model license.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Dockerfile		Dockerfile
README.md		README.md
client.py		client.py
handler.py		handler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSM-1b on RunPod Serverless

Prerequisites

Testing Your Hugging Face Token

Building the Docker Image

Pushing to Docker Hub

Setting Up the RunPod Serverless Endpoint

Using the Client

API Format

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CSM-1b on RunPod Serverless

Prerequisites

Testing Your Hugging Face Token

Building the Docker Image

Pushing to Docker Hub

Setting Up the RunPod Serverless Endpoint

Using the Client

API Format

Troubleshooting

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages