This project sets up Sesame's CSM-1b speech generation model on RunPod's serverless infrastructure, allowing you to generate high-quality speech from text prompts through an API.
- A Hugging Face account with access to sesame/csm-1b
- A RunPod account with API access
- Docker installed on your local machine
Before building the Docker image, verify that your Hugging Face token has access to the CSM-1b model:
pip install huggingface_hub
python test_hf_token.py --token YOUR_HF_TOKENIf successful, you'll see confirmation that your token can access the model.
- Clone this repository:
git clone <your-repo-url>
cd <your-repo-directory>- Build the Docker image:
docker build -t csm-1b-runpod .- Tag your image:
docker tag csm-1b-runpod yourusername/csm-1b-runpod:latest- Push to Docker Hub:
docker login
docker push yourusername/csm-1b-runpod:latest- Log in to your RunPod account
- Go to Serverless > Endpoints
- Click "New Endpoint"
- Select "Docker Hub" and enter your image URL:
yourusername/csm-1b-runpod:latest - Choose a GPU type (A10 or better recommended)
- Set min/max workers based on your needs
- Add the following environment variable:
- Key:
HF_TOKEN - Value:
YOUR_HUGGING_FACE_TOKEN
- Key:
- Create the endpoint and wait for it to activate
The included client script lets you interact with your serverless endpoint:
pip install requests pydub
python client.py --text "Hello from Sesame!" --endpoint YOUR_ENDPOINT_ID --api-key YOUR_RUNPOD_API_KEYYou can also set environment variables instead of passing arguments:
export RUNPOD_ENDPOINT_ID=YOUR_ENDPOINT_ID
export RUNPOD_API_KEY=YOUR_RUNPOD_API_KEY
python client.py --text "Hello from Sesame!"Your serverless endpoint accepts the following JSON input:
{
"input": {
"text": "Hello, this is a test.",
"speaker": 0,
"max_audio_length_ms": 10000,
"context_audios": ["https://example.com/audio1.wav"],
"context_texts": ["Previous utterance"],
"context_speakers": [1]
}
}The response will include:
{
"audio_base64": "base64-encoded WAV audio",
"sample_rate": 24000,
"duration_seconds": 1.5
}- Model Loading Errors: Ensure your Hugging Face token has accepted the model terms.
- CUDA Out of Memory: Try a larger GPU instance or reduce the max audio length.
- Slow Generation: The initial request will be slower as the model loads into memory.
This project uses the Apache-2.0 license, in accordance with the CSM model license.