-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Hi team! I'm trying to replicate the experiments on [1], so once I have that set up, I can run some of mine to try out some ideas with MX formats. First of all, I have an operational question:
- Let's say my configuration is to use for activations and weights mxfp8 (with e4m3), and bf16 for accumulation and elementwise ops. I'm gonna try the PTQ approach, to finetune a bit in MX, and then evaluate my test model (BERT). In that case, if I start from a pre-trained model from hugging face, should I keep the model in fp32, and let the library run the BF and MX quantization steps, or do I have to, or can I, use bf16 natively like
transformersallows in theirTrainingArguments(see below).
-
Same question above if I were to evaluare direct-cast. Should I start from pretained, finetune in FP32, and then evaluate with MX?
-
Before picking microxcaling as my modelling library, I found that there's some work on pytorch/ao to add MX support (see [2]). What's your take on that? I get the feeling that, rather than modelling MX with numerical accuracy, that's actually more inclined to add productive support, and use MX with specific hw (like blackwell's tensor cores).
Thanks, great work, and papers are super good reads!
[1] B. D. Rouhani et al., “Microscaling Data Formats for Deep Learning,” Oct. 19, 2023, arXiv: arXiv:2310.10537. doi: 10.48550/arXiv.2310.10537.
[2] https://github.com/pytorch/ao/tree/main/torchao/prototype/mx_formats