Skip to content

vito-ai/kotlin-sample

Repository files navigation

ReturnZero Speech-to-Text Kotlin Client

This project demonstrates how to use ReturnZero's Speech-to-Text API with Kotlin using gRPC for streaming audio recognition.

Download Audio Example

YOUTUBE_ID=''
yt-dlp --audio-quality 0 --audio-format wav  --extract-audio https://www.youtube.com/watch\?v\=${YOUTUBE_ID} -o ${YOUTUBE_ID}.wav

Prerequisites

  • JDK 17 or higher
  • Gradle (for local execution)
  • Docker (for containerized execution)
  • RTZR AI API credentials (client ID and client secret)
  • Audio file for testing (WAV, AU, or AIFF format)

Project Structure

├── Dockerfile                         # Docker configuration
├── build.gradle.kts                   # Gradle build configuration
├── settings.gradle.kts                # Gradle settings
├── run-local.sh                       # Script to run locally with Gradle
├── build-and-run.sh                   # Script to build and run with Docker
├── src/
│   ├── main/
│   │   ├── kotlin/
│   │   │   └── ai/
│   │   │       └── returnzero/
│   │   │           ├── Main.kt                # Main application
│   │   │           ├── ReturnZeroClient.kt    # API client
│   │   │           └── FileStreamer.kt        # Audio file streaming utility
│   │   └── proto/
│   │       └── vito-stt-client.proto          # gRPC protocol definition

Running the Application

You can run this application in two ways:

1. Local Execution (run-local.sh)

Run directly on your local machine using Gradle.

  1. Set your API credentials as environment variables:

    export RTZR_CLIENT_ID="your_client_id"
    export RTZR_CLIENT_SECRET="your_client_secret"
  2. Make the script executable:

    chmod +x run-local.sh
  3. Run the application with an audio file:

    # Normal mode (play once)
    ./run-local.sh /path/to/your/audio/file.wav
    
    # Repeat mode (infinite loop)
    ./run-local.sh --repeat /path/to/your/audio/file.wav

2. Docker Execution (build-and-run.sh)

Build and run in a Docker container.

  1. Set your API credentials as environment variables:

    export RTZR_CLIENT_ID="your_client_id"
    export RTZR_CLIENT_SECRET="your_client_secret"
  2. Make the script executable:

    chmod +x build-and-run.sh
  3. Run the application with an audio file:

    # Normal mode (play once)
    ./build-and-run.sh /path/to/your/audio/file.wav
    
    # Repeat mode (infinite loop)
    ./build-and-run.sh --repeat /path/to/your/audio/file.wav

This script will:

  • Build a Docker image for the application
  • Mount the directory containing your audio file
  • Run the container with your API credentials
  • Process the audio file and output the transcription results

Command Line Options

The application supports the following command line options:

  • --repeat: Enable infinite loop mode for audio streaming. The audio file will be repeated continuously until the application is terminated.

How It Works

  1. The application authenticates with RTZR STT API to obtain an access token
  2. It establishes a gRPC connection to the streaming STT service
  3. The audio file is read and streamed in chunks, simulating real-time audio
  4. The API returns both interim and final transcription results
  5. Results are printed to the console as they are received

Customizing

Speech Recognition Parameters

To modify the speech recognition parameters, edit the DecoderConfig in Main.kt:

val config = DecoderConfig.newBuilder()
    .setSampleRate(8000)                       // Audio sample rate (Hz)
    .setEncoding(DecoderConfig.AudioEncoding.LINEAR16)  // Audio encoding
    .setUseItn(true)                          // Inverse text normalization
    .setUseDisfluencyFilter(false)            // Filter disfluencies
    .setUseProfanityFilter(false)             // Filter profanity
    .setModelName("sommers_ko")               // Language model to use
    .build()

Keyword Boosting

You can enhance recognition for specific keywords by adding them to the configuration:

.addAllKeywords(listOf(
    "keyword1",          // Default score: 2.0
    "keyword2:3.5",      // Higher score: 3.5 (better recognition)
    "keyword3:-1"        // Lower score: -1 (reduced recognition)
))

Notes for keyword boosting:

  • Scores must be between -5.0 and 5.0
  • Korean keywords must use Korean pronunciation (e.g., "에스티티" instead of "STT")
  • Each keyword must be max 20 characters and you can add up to 100 keywords

Troubleshooting

  • Authentication Error: Verify your client ID and client secret are correct.
  • File Format Error: Ensure your audio file is in one of the supported formats.
  • Docker Issues: Make sure Docker is installed and running correctly.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published