Skip to content

Trying to replicate experiments & torchao comparison #43

@thepalbi

Description

@thepalbi

Hi team! I'm trying to replicate the experiments on [1], so once I have that set up, I can run some of mine to try out some ideas with MX formats. First of all, I have an operational question:

  1. Let's say my configuration is to use for activations and weights mxfp8 (with e4m3), and bf16 for accumulation and elementwise ops. I'm gonna try the PTQ approach, to finetune a bit in MX, and then evaluate my test model (BERT). In that case, if I start from a pre-trained model from hugging face, should I keep the model in fp32, and let the library run the BF and MX quantization steps, or do I have to, or can I, use bf16 natively like transformers allows in their TrainingArguments (see below).
Image
  1. Same question above if I were to evaluare direct-cast. Should I start from pretained, finetune in FP32, and then evaluate with MX?

  2. Before picking microxcaling as my modelling library, I found that there's some work on pytorch/ao to add MX support (see [2]). What's your take on that? I get the feeling that, rather than modelling MX with numerical accuracy, that's actually more inclined to add productive support, and use MX with specific hw (like blackwell's tensor cores).

Thanks, great work, and papers are super good reads!

[1] B. D. Rouhani et al., “Microscaling Data Formats for Deep Learning,” Oct. 19, 2023, arXiv: arXiv:2310.10537. doi: 10.48550/arXiv.2310.10537.

[2] https://github.com/pytorch/ao/tree/main/torchao/prototype/mx_formats

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions