Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal c

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-05-17 00:00:13

Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.

Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from: https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF

In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0. A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:

Important Note On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with -Dllama.VectorAPI=false, but expect sub-optimal performance. Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API.

**Notes Running on a single CCD e.g. taskset -c 0-15 jbang Llama3.java ... since inference is constrained by memory bandwidth.

Leave a Comment