The TriLM ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with IQ2_TN. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:
The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!
Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!
Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get