Skip to content

Conversation

@arunasrivastava
Copy link
Contributor

PR for decoupled streaming logic

src/server.py Outdated
# Constants
DEBUG = False
NUM_SECONDS_PER_CHUNK = 0.5
CHUNK_SIZE_SAMPLES = 320 # 20ms at 16kHz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the 320 (stride) and the 400 (receptive field) don't have to be hardcoded. You can calculate them like this:

def calculate_cnn_window(model: Wav2Vec2ForCTC):
    receptive_field = 1
    stride = 1
    for conv_layer in model.wav2vec2.feature_extractor.conv_layers:
        assert hasattr(conv_layer, "conv")
        conv = conv_layer.conv
        assert isinstance(conv, torch.nn.Conv1d)
        receptive_field += (conv.kernel_size[0] - 1) * stride
        stride *= conv.stride[0]
    return receptive_field, stride

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there zero performance benefit from hardcoding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in any way that matters. If you just compute these constants once on startup and store them into the same module level variables, you are looking at 5.7% of a millisecond of extra server startup time. A handful of multiplications like this are so fast that you wouldn't even be able to perceive it if you re-ran this computation on every chunk (no need to do this though). Most compiled languages will be able to do some optimizations on compile-time constants (such as constant folding, division through reciprocal multiplication, etc.). However this is not something Python really does due to its more interpreted nature. Moreover, even for compiled languages, it is almost never the right move to sacrifice readability/flexibility for the minuscule gain of hardcoding everything. I encourage you to run some experiments and measure the performance yourself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, I will avoid hardcoded values where readable fast functions can be put in place instead.

break

buffer = b"" # Clear the buffer
# Process 20ms chunks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 20ms is the stride. The CNN actually has a receptive field of 25ms (400 samples). By only ever feeding it 20ms chunks and padding them to fit the 25ms (MIN_LEN_SAMPLES), you never allow it to read all the audio it needs to perform equivalently to running the full model once at the end. You'll want to let it collect some multiple of 320 samples + 80 samples in each chunk or, if you prefer to allow the last CNN receptive field to be incomplete, you should recompute it when more audio comes in

DEBUG = False
NUM_SECONDS_PER_CHUNK = 0.5
CHUNK_SIZE_SAMPLES = 320 # 20ms at 16kHz
TRANSFORMER_INTERVAL = 30
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll definitely want this to be adaptive. Not an immediate priority. Make sure you write good benchmarking code first so you can measure whether each change is an improvement. Then we can also combine with the VAD and local agreement optimizations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote some suggested utils for benchmarking code here. Make sure to add accuracy, average over multiple samples, and create nice figures for a blog post

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants