There are many usecases which demand on-device inference for text completion, image generation / classification which cannot rely on communication wit

Search code, repositories, users, issues, pull requests...

submited by
Style Pass
2024-11-16 18:00:02

There are many usecases which demand on-device inference for text completion, image generation / classification which cannot rely on communication with large foundation models like Claude, Gemini etc. The reasons for this limitation can arise from privacy requirements, lack of internet connectivity or the need for fast on-device inference for a better user experience.

In some cases invoking a foundation model for the task would be over kill. Consider a small bot running on an embedded ARM device that only needs to carry out image segmentation to traverse the environment without communications with a server. Being able to carry out this segmentation on-device would drastically improve the speed and reliability of the bot.

Another concern is cost. Hosting a even a small model backed by GPUs for inference is extremely expensive with prices ranging from $0.5-0.7 / hour of GPU, if you can get them. CPUs however are readily available in high quantity, are much cheaper than GPUs and can be used for ther tasks as well such hosting webservers.

With Kaoken, we attempt to apply standard optimization techniques to speed up inference of small models on the CPU and diversify the options available to developers when deploying smaller models.

Leave a Comment