Technical Integration: Supertonic TTS + Rubber Band WASM
This document outlines the architecture for integrating a singing voice synthesizer into the Hyphon DAW. The goal is to converge high-latency neural phoneme generation (Supertonic) with low-latency, phase-coherent time-stretching (Rubber Band Library) to create a "zero-latency" feeling vocal instrument.
- Role: Generates the raw "prototypical" utterance of lyrics.
- Engine: ONNX Runtime Web (WebGPU/WASM).
- Output: Flat speech audio (not pitched to score).
- Constraints: Non-autoregressive flow matching. Must run in a Worker or background thread to prevent UI locking.
- Role: Forces the speech sample to conform to the MIDI score's pitch and duration.
- Key Feature: Formant preservation (avoids "chipmunk" effect).
- Engine: C++ Library compiled to WebAssembly (WASM).
- Processing: Phase Vocoder with adaptive time-domain transient handling.
- Web Audio API: Demands blocks of 128 frames.
- Rubber Band DSP: Requires variable/large blocks (e.g., 1024+ frames) for frequency resolution.
- Constraint:
AudioWorkletProcessor.process()cannot block.
We will implement a Single-Producer Single-Consumer (SPSC) Ring Buffer using SharedArrayBuffer and Atomics.
- Input Ring Buffer: Transfers raw audio from the Decoder/Main Thread -> AudioWorklet.
- Output Ring Buffer: Transfers processed audio from AudioWorklet -> Playback.
Data Flow:
SupertonicServicegenerates raw WAV.- Audio is decoded and pushed to
InputRingBuffer. AudioWorkletpulls data into an internal accumulator.- When
accumulator >= 1024frames, pass torubberband-wasm. - Output pushed to
OutputRingBuffer. AudioWorkletpops 128 frames for the browser output.
To prevent "slurred" consonants when stretching words to fit long notes:
- Consonants: Fixed duration (stretch ratio ~1.0).
- Vowels: Elastic duration (absorbs the remaining time).
- Implementation: Use Rubber Band's "Time Map" feature to apply variable stretch ratios across the sample.
- Vibrato: Implemented via LFO (Sine Oscillator) modulating the pitch target in real-time.
- Portamento: Calculated pitch glides updated at control rate (e.g., 10ms) sent to Rubber Band's
setPitch.