Many app developers are interested in building on device experiences that integrate increasingly capable large language models (LLMs). Running these models locally on Apple silicon enables developers to leverage the capabilities of the user's device for cost-effective inference, without sending data to and from third party servers, which also helps protect user privacy. In order to do this, the models must be carefully optimized to effectively utilize the available system resources, because LLMs often have high demands for both memory and processing power.
This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.
We take the official definition and trained weights of the Llama-3.1-8B-Instruct model hosted on Hugging Face. We outline the steps to convert the model to the Core ML format using Core ML Tools, optimize it for on-device inference on a Mac, and benchmark its performance. We use a Mac with M1 Max and specifically target the GPU, as the models like the Llama-3.1-8B-Instruct are usually constrained by memory bandwidth, and the GPU offers the best combination of compute FLOPS and memory bandwidth on the device of our interest.