-
Notifications
You must be signed in to change notification settings - Fork 3
3 stream to ipa #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
3 stream to ipa #13
Conversation
src/server.py
Outdated
| # Constants | ||
| DEBUG = False | ||
| NUM_SECONDS_PER_CHUNK = 0.5 | ||
| CHUNK_SIZE_SAMPLES = 320 # 20ms at 16kHz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the 320 (stride) and the 400 (receptive field) don't have to be hardcoded. You can calculate them like this:
def calculate_cnn_window(model: Wav2Vec2ForCTC):
receptive_field = 1
stride = 1
for conv_layer in model.wav2vec2.feature_extractor.conv_layers:
assert hasattr(conv_layer, "conv")
conv = conv_layer.conv
assert isinstance(conv, torch.nn.Conv1d)
receptive_field += (conv.kernel_size[0] - 1) * stride
stride *= conv.stride[0]
return receptive_field, strideThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there zero performance benefit from hardcoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in any way that matters. If you just compute these constants once on startup and store them into the same module level variables, you are looking at 5.7% of a millisecond of extra server startup time. A handful of multiplications like this are so fast that you wouldn't even be able to perceive it if you re-ran this computation on every chunk (no need to do this though). Most compiled languages will be able to do some optimizations on compile-time constants (such as constant folding, division through reciprocal multiplication, etc.). However this is not something Python really does due to its more interpreted nature. Moreover, even for compiled languages, it is almost never the right move to sacrifice readability/flexibility for the minuscule gain of hardcoding everything. I encourage you to run some experiments and measure the performance yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, I will avoid hardcoded values where readable fast functions can be put in place instead.
| break | ||
|
|
||
| buffer = b"" # Clear the buffer | ||
| # Process 20ms chunks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 20ms is the stride. The CNN actually has a receptive field of 25ms (400 samples). By only ever feeding it 20ms chunks and padding them to fit the 25ms (MIN_LEN_SAMPLES), you never allow it to read all the audio it needs to perform equivalently to running the full model once at the end. You'll want to let it collect some multiple of 320 samples + 80 samples in each chunk or, if you prefer to allow the last CNN receptive field to be incomplete, you should recompute it when more audio comes in
| DEBUG = False | ||
| NUM_SECONDS_PER_CHUNK = 0.5 | ||
| CHUNK_SIZE_SAMPLES = 320 # 20ms at 16kHz | ||
| TRANSFORMER_INTERVAL = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll definitely want this to be adaptive. Not an immediate priority. Make sure you write good benchmarking code first so you can measure whether each change is an improvement. Then we can also combine with the VAD and local agreement optimizations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrote some suggested utils for benchmarking code here. Make sure to add accuracy, average over multiple samples, and create nice figures for a blog post
PR for decoupled streaming logic