This project demonstrates how to use ReturnZero's Speech-to-Text API with Kotlin using gRPC for streaming audio recognition.
YOUTUBE_ID=''
yt-dlp --audio-quality 0 --audio-format wav --extract-audio https://www.youtube.com/watch\?v\=${YOUTUBE_ID} -o ${YOUTUBE_ID}.wav- JDK 17 or higher
- Gradle (for local execution)
- Docker (for containerized execution)
- RTZR AI API credentials (client ID and client secret)
- Audio file for testing (WAV, AU, or AIFF format)
├── Dockerfile # Docker configuration
├── build.gradle.kts # Gradle build configuration
├── settings.gradle.kts # Gradle settings
├── run-local.sh # Script to run locally with Gradle
├── build-and-run.sh # Script to build and run with Docker
├── src/
│ ├── main/
│ │ ├── kotlin/
│ │ │ └── ai/
│ │ │ └── returnzero/
│ │ │ ├── Main.kt # Main application
│ │ │ ├── ReturnZeroClient.kt # API client
│ │ │ └── FileStreamer.kt # Audio file streaming utility
│ │ └── proto/
│ │ └── vito-stt-client.proto # gRPC protocol definition
You can run this application in two ways:
Run directly on your local machine using Gradle.
-
Set your API credentials as environment variables:
export RTZR_CLIENT_ID="your_client_id" export RTZR_CLIENT_SECRET="your_client_secret"
-
Make the script executable:
chmod +x run-local.sh
-
Run the application with an audio file:
# Normal mode (play once) ./run-local.sh /path/to/your/audio/file.wav # Repeat mode (infinite loop) ./run-local.sh --repeat /path/to/your/audio/file.wav
Build and run in a Docker container.
-
Set your API credentials as environment variables:
export RTZR_CLIENT_ID="your_client_id" export RTZR_CLIENT_SECRET="your_client_secret"
-
Make the script executable:
chmod +x build-and-run.sh
-
Run the application with an audio file:
# Normal mode (play once) ./build-and-run.sh /path/to/your/audio/file.wav # Repeat mode (infinite loop) ./build-and-run.sh --repeat /path/to/your/audio/file.wav
This script will:
- Build a Docker image for the application
- Mount the directory containing your audio file
- Run the container with your API credentials
- Process the audio file and output the transcription results
The application supports the following command line options:
--repeat: Enable infinite loop mode for audio streaming. The audio file will be repeated continuously until the application is terminated.
- The application authenticates with RTZR STT API to obtain an access token
- It establishes a gRPC connection to the streaming STT service
- The audio file is read and streamed in chunks, simulating real-time audio
- The API returns both interim and final transcription results
- Results are printed to the console as they are received
To modify the speech recognition parameters, edit the DecoderConfig in Main.kt:
val config = DecoderConfig.newBuilder()
.setSampleRate(8000) // Audio sample rate (Hz)
.setEncoding(DecoderConfig.AudioEncoding.LINEAR16) // Audio encoding
.setUseItn(true) // Inverse text normalization
.setUseDisfluencyFilter(false) // Filter disfluencies
.setUseProfanityFilter(false) // Filter profanity
.setModelName("sommers_ko") // Language model to use
.build()You can enhance recognition for specific keywords by adding them to the configuration:
.addAllKeywords(listOf(
"keyword1", // Default score: 2.0
"keyword2:3.5", // Higher score: 3.5 (better recognition)
"keyword3:-1" // Lower score: -1 (reduced recognition)
))Notes for keyword boosting:
- Scores must be between -5.0 and 5.0
- Korean keywords must use Korean pronunciation (e.g., "에스티티" instead of "STT")
- Each keyword must be max 20 characters and you can add up to 100 keywords
- Authentication Error: Verify your client ID and client secret are correct.
- File Format Error: Ensure your audio file is in one of the supported formats.
- Docker Issues: Make sure Docker is installed and running correctly.
This project is licensed under the MIT License - see the LICENSE file for details.