Question Regarding the FP8 Dequantization Method / Scales

#43
by cuimou - opened

First off, thank you so much for sharing this incredible work with the community! I'm currently experimenting with the FP8 quantized version of your model and had a technical question about the dequantization process.

I've been inspecting the .safetensors file to better understand the quantization scheme. My analysis of the file header confirms that the weight tensors are indeed stored in F8_E4M3 format. However, I'm having difficulty locating the corresponding dequantization scale factors needed to restore the weights to a higher precision format like BF16.

Specifically, I've checked for two common storage methods:

  1. Per-Tensor Metadata: I parsed the header to see if the scales were stored in the metadata dictionary attached to each FP8 tensor, but this metadata field appears to be absent.
  2. Separate Scale Tensor: I also scanned the list of all tensors in the file, looking for a separate tensor stored in a high-precision format (like F32 or F16) that might contain the scales for all the other tensors. However, my inspection shows that all tensors in the file are of the F8_E4M3 dtype.

This leads me to my question: Could you please shed some light on the intended method for dequantizing these FP8 weights?

Is there a novel technique at play, a global scale factor I'm missing, or perhaps an implicit method for deriving the scales? Any information you could provide would be immensely helpful for me and others in the community looking to correctly load and utilize this highly efficient model.

Thank you again for your time and for your fantastic contribution!

Best regards.

Sign up or log in to comment