3 stream to ipa #13

arunasrivastava · 2025-08-10T22:44:31Z

PR for decoupled streaming logic

…shold

SanderGi · 2025-08-10T22:55:36Z

src/server.py

 # Constants
 DEBUG = False
-NUM_SECONDS_PER_CHUNK = 0.5
+CHUNK_SIZE_SAMPLES = 320  # 20ms at 16kHz


Nit: the 320 (stride) and the 400 (receptive field) don't have to be hardcoded. You can calculate them like this:

def calculate_cnn_window(model: Wav2Vec2ForCTC): receptive_field = 1 stride = 1 for conv_layer in model.wav2vec2.feature_extractor.conv_layers: assert hasattr(conv_layer, "conv") conv = conv_layer.conv assert isinstance(conv, torch.nn.Conv1d) receptive_field += (conv.kernel_size[0] - 1) * stride stride *= conv.stride[0] return receptive_field, stride

is there zero performance benefit from hardcoding

Not in any way that matters. If you just compute these constants once on startup and store them into the same module level variables, you are looking at 5.7% of a millisecond of extra server startup time. A handful of multiplications like this are so fast that you wouldn't even be able to perceive it if you re-ran this computation on every chunk (no need to do this though). Most compiled languages will be able to do some optimizations on compile-time constants (such as constant folding, division through reciprocal multiplication, etc.). However this is not something Python really does due to its more interpreted nature. Moreover, even for compiled languages, it is almost never the right move to sacrifice readability/flexibility for the minuscule gain of hardcoding everything. I encourage you to run some experiments and measure the performance yourself

cool, I will avoid hardcoded values where readable fast functions can be put in place instead.

src/transcription.py

SanderGi · 2025-08-10T23:22:48Z

src/server.py

-                    break
-
-                buffer = b""  # Clear the buffer
+            # Process 20ms chunks


The 20ms is the stride. The CNN actually has a receptive field of 25ms (400 samples). By only ever feeding it 20ms chunks and padding them to fit the 25ms (MIN_LEN_SAMPLES), you never allow it to read all the audio it needs to perform equivalently to running the full model once at the end. You'll want to let it collect some multiple of 320 samples + 80 samples in each chunk or, if you prefer to allow the last CNN receptive field to be incomplete, you should recompute it when more audio comes in

SanderGi · 2025-08-10T23:26:32Z

src/server.py

 DEBUG = False
-NUM_SECONDS_PER_CHUNK = 0.5
+CHUNK_SIZE_SAMPLES = 320  # 20ms at 16kHz
+TRANSFORMER_INTERVAL = 30


We'll definitely want this to be adaptive. Not an immediate priority. Make sure you write good benchmarking code first so you can measure whether each change is an improvement. Then we can also combine with the VAD and local agreement optimizations

Wrote some suggested utils for benchmarking code here. Make sure to add accuracy, average over multiple samples, and create nice figures for a blog post

src/server.py

arunasrivastava added 3 commits August 9, 2025 23:42

initial working decoupled transformer and feature extractor

be0b0d9

updating stream code. todo: check transformer interval is a good thre…

a659b8a

…shold

remove old transcription version

aa3592b