How to save real quantized model to reproduce inference speedup and memory saving?

Hi, 
thanks for the amazing work you did with this method. 

I am trying to save the quantized model to use it for further analysis but the model I save using the "save_qmodel_path" parameter is the same size as the original model since it is still saved in bf16. 

I want to reproduce the inference speedup and the memory saving you mentioned for example in table 9 of the paper. 

Could you please provide help how to save the real quantized model in INT4 ?  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to save real quantized model to reproduce inference speedup and memory saving? #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

How to save real quantized model to reproduce inference speedup and memory saving? #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions