Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.    By click

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-04-30 21:30:37

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

This encoding has the same number of exponent bits as float32. That makes conversion relatively straightforward, even in the absence of hardware support. For example, converting brain16 to binary32 means simply shifting 16 bits to the left.

The issue is that converting weights from bf16 to fp16 will cause 3 bits of knowledge to be lost. There is currently no way to evaluate models like Mistral at full fidelity, without f32, using llama.cpp.

This change fixes that, by adding a bf16 data type to GGML. Support for CPU inference has been implemented along with optimizations for the AVX2, AVX512F, and AVX512BF16 ISAs. Perplexity on Mistral 7b 0.2 improves somewhere around -0.0024 to -0.0046 compared to using fp16

The issue is that converting weights from bf16 to fp16 will cause 3 bits of knowledge to be lost. There is currently no way to evaluate models like Mistral at full fidelity, without f32, using llama.cpp.

Leave a Comment