Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling. Core principles of the transformer architecture fro

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2025-01-04 22:30:06

Llama3.cu is a CUDA native implementation of the LLaMA3 architecture for causal language modeling. Core principles of the transformer architecture from the papers Attention is All You Need and LLaMA: Open and Efficient Foundation Language Models are implemented using custom CUDA kernel definitions, enabling scalable parallel processing on Nvidia GPUs.

The models are expected to be downloaded off of HuggingFace. They are stored as BF16 parameter weights in a .safetensor file, which during load time to the CUDA device, is converted to FP16 via a FP32 proxy. Hence, a CUDA device with a minimum of 24GB VRAM must be used.

For this inference engine to work, a SafeTensor formatted file(s) of the Llama3-8b model needs to be stored in the ./model_weights/ folder. Head to the HuggingFace - meta-llama/Llama-3.1-8B-Instruct repo to get access to the model. Additionally, Generate a Hugging Face Token so that the next step can successfully download the weights files.

Once the Docker container has started up, run the following command to store the Hugging Face token as an environment variable, replacing <your_token> with the token you generated.

Leave a Comment