Skip to content

Conversation

@bnogas
Copy link

@bnogas bnogas commented Dec 17, 2025

Proposed changes

Summary

While running Flux on a 3/7 MIG partition of our H200 GPUs, we observed that execution was limited to 5 streams per engine instance. To address this, I added an override to the default configuration to allow higher concurrency.

2025-12-16T20:56:51.380385855Z  WARN impeller::charmer::lib: Setting max-streams=5 because we couldn't read GPU RAM max_streams=5 error=the current user does not have permission to perform this operation

Notes

It’s currently unclear how Flux determines available memory. This limitation may be related to differences in the MIG-specific API or another underlying issue, and may require further investigation.

Types of changes

What types of changes does your code introduce to the Deepgram self-hosted resources?
Put an x in the boxes that apply

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update or tests (if none of the other choices apply)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING doc
  • I have tested my changes in my local self-hosted environment
    • by running this version of the helm chart and stress testing flux
  • I have added necessary documentation (if appropriate)

Further comments

@bnogas bnogas requested review from a team and therealevanhenry as code owners December 17, 2025 21:06
@jkroll-deepgram
Copy link
Contributor

@bnogas Deepgram doesn't officially support MIG partitions, and it looks like the underlying issue here is that your GPU isn't being detected, so Deepgram is falling back to our low CPU default of 5 streams.

Are you getting better performance out of raising the max_streams?

If you check your Engine logs, on startup are you seeing a log like INFO impeller::config: Using devices: Gpu(0) (Using GPU), or rather, is it using CPU?

@bnogas
Copy link
Author

bnogas commented Jan 26, 2026

@jkroll-deepgram It uses GPU
kubectl logs -n core deepgram-engine-bd9768cc-pvrvr |grep -i Gpu

2026-01-26T23:13:47.516123726Z  INFO impeller::config: Using devices: Gpu(0)
2026-01-26T23:13:53.507403716Z  WARN impeller: Unable to obtain GPU maximum memory! err=NoPermission gpu_id=0 gpu_name="NVIDIA H100 80GB HBM3"
2026-01-26T23:13:53.507413208Z  INFO impeller: Setting GPU model cache size based on auto lookup table. gpu_id=Gpu(0) gpu_name="Unknown" gpu_memory_size=0 gpu_cache_size=2

I believe there is a difference in API call to get gpu_memory_size when MIG is enabled

Are you getting better performance out of raising the max_streams?

Yes, we have stress tested up to 100 streams on a single engine with 3/7 MIG partition of H100.

The other PR adds support for MIG partitions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants