99% of the time is being spent in Tensor.MatrixVectorMultiply. There is a pthreads and a cuda variant, though both are not optimal. The cuda version s

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-06-05 17:00:03

99% of the time is being spent in Tensor.MatrixVectorMultiply. There is a pthreads and a cuda variant, though both are not optimal. The cuda version spends most of its time uploading tensors, and the pthreads version is running in multiple lua states spread accross cores.

To try the cuda version, change use_pthreads() in llama.lua to use_cuda(). It depends on libnvrtc for compiling kernels and libcuda for everything else.

I mostly used https://github.com/mukel/llama3.java as source reference. You can find the instructions on how to download "Meta-Llama-3-8B-Instruct-Q4_0.gguf" in there.

Leave a Comment