Implement Blackwell MXFP8 recipe #512

zianglih · 2026-01-24T02:34:38Z

WIP. This PR depends on sgl-project/sglang#17449

gemini-code-assist · 2026-01-24T02:34:56Z

Summary of Changes

Hello @zianglih, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new Python utility script designed to convert existing Hugging Face models, specifically those stored in safetensors format with BF16 or FP16 precision, into a more memory-efficient MXFP8 format. This conversion is crucial for optimizing model deployment and inference, particularly on hardware that supports FP8 operations, by reducing the model's memory footprint while maintaining performance. The script automates the process of identifying quantizable weights, applying MXFP8 quantization with UE8M0 scales, and correctly structuring the output for future use.

Highlights

New MXFP8 Conversion Script: A new Python script, tools/convert_hf_to_mxfp8.py, has been added to convert Hugging Face safetensors models from BF16/FP16 precision to the more memory-efficient MXFP8 format.
Group-wise Quantization with UE8M0 Scales: The script implements group-wise MXFP8 quantization using UE8M0 scales, mirroring the approach used in sglang but ensuring scales are kept in an unswizzled group layout for proper serialization.
Selective Weight Quantization: Logic is included to selectively quantize weights, skipping specific layers such as layernorms, embeddings, router, and lm_head to preserve model integrity and performance where high precision is critical.
Metadata and Configuration Updates: The conversion process automatically updates the config.json and model.safetensors.index.json files with relevant quantization metadata, including a list of modules that were intentionally not converted.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new script for converting Hugging Face models to MXFP8 format. The script is well-structured and covers the main steps of conversion, including quantizing weights, handling non-quantized layers, and updating model configuration files. My review includes suggestions to improve robustness by using context managers for file operations, enhancing efficiency by using appropriate data structures, and improving code clarity and maintainability. I've also pointed out a potential issue with dynamic path modification for imports.

gemini-code-assist · 2026-01-24T02:36:38Z

tools/convert_hf_to_mxfp8.py

+
+    config_path = os.path.join(input_path, "config.json")
+    if os.path.exists(config_path):
+        cfg = json.load(open(config_path))


Opening files without a with statement can lead to resource leaks if an exception occurs before the file is closed. It's best practice to use a with block to ensure files are always closed correctly. This applies here and on lines 161 and 167.

For example, this line could be rewritten as:

with open(config_path, 'r') as f: cfg = json.load(f)

gemini-code-assist · 2026-01-24T02:36:38Z

tools/convert_hf_to_mxfp8.py

+    except ImportError:
+        repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), "..", ".."))
+        sglang_path = os.path.join(repo_root, "sglang", "python")
+        if sglang_path not in sys.path:
+            sys.path.append(sglang_path)
+        from sglang.srt.layers.quantization.fp8_utils import mxfp8_group_quantize


Dynamically modifying sys.path can make dependency management fragile and less explicit. It's generally better to rely on standard Python packaging practices. Consider instructing users to install the project in editable mode (e.g., pip install -e .) or to set their PYTHONPATH environment variable. If you must keep this dynamic import, consider adding a comment explaining why it's necessary and how it works.

gemini-code-assist · 2026-01-24T02:36:38Z

tools/convert_hf_to_mxfp8.py

+    if k % 32 != 0:
+        raise ValueError(f"Last dim {k} must be divisible by 32 for MXFP8.")


This check for divisibility by 32 is already performed in the should_quantize function before this function is called. This makes the if condition and ValueError redundant in the current control flow. To maintain this as a precondition for the function, consider changing it to an assert. This documents the assumption without adding runtime overhead in production builds (if assertions are disabled) and makes the code's intent clearer.

assert k % 32 == 0, f"Last dim {k} must be divisible by 32 for MXFP8."

gemini-code-assist · 2026-01-24T02:36:38Z

tools/convert_hf_to_mxfp8.py

+        self.modules_to_not_convert: List[str] = []
+
+    def add_result(self, filename: str, q_weights: Dict[str, torch.Tensor], module_names: List[str]) -> None:
+        for key, tensor in q_weights.items():
+            self.weight_map[key] = filename
+            self.total_size += tensor.numel() * tensor.element_size()
+        self.modules_to_not_convert.extend(module_names)


The modules_to_not_convert attribute is a list that can grow large and contain duplicates, which are then removed later using list(set(...)). It's more efficient to use a set from the start to store unique module names. This avoids unnecessary memory usage and makes the intention clearer.

Note: You'll need to import Set from typing.

After this change, you'll also need to update line 155 to quantization_config['modules_to_not_convert'] = sorted(list(result_collector.modules_to_not_convert)) to ensure deterministic output.

self.modules_to_not_convert: Set[str] = set() def add_result(self, filename: str, q_weights: Dict[str, torch.Tensor], module_names: List[str]) -> None: for key, tensor in q_weights.items(): self.weight_map[key] = filename self.total_size += tensor.numel() * tensor.element_size() self.modules_to_not_convert.update(module_names)

Add mxfp8 conversion script.

09f50cc

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

zianglih mentioned this pull request Jan 24, 2026

Add mxfp8 support for online quantization, Triton dense linear, and CUTLASS MoE sgl-project/sglang#17449

Open

5 tasks

zianglih changed the title ~~Add mxfp8 conversion script.~~ Implement Blackwell MXFP8 recipe Jan 24, 2026

zianglih and others added 3 commits January 23, 2026 18:58

Update.

5027382

Implement Blackwell MXFP8 recipe.

14c87f8

Merge branch 'radixark:main' into mxfp8

de30504

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Blackwell MXFP8 recipe #512

Implement Blackwell MXFP8 recipe #512

zianglih commented Jan 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

gemini-code-assist bot Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if k % 32 != 0:
		raise ValueError(f"Last dim {k} must be divisible by 32 for MXFP8.")

Implement Blackwell MXFP8 recipe #512

Are you sure you want to change the base?

Implement Blackwell MXFP8 recipe #512

Conversation

zianglih commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zianglih commented Jan 24, 2026 •

edited

Loading