In less than 50 lines of code, you can deploy a Bert-like model from the Hugging Face library and achieve over 100 requests per second with latencies

How to 10x throughput when serving Hugging Face models without a GPU

submited by
Style Pass
2021-06-24 14:00:06

In less than 50 lines of code, you can deploy a Bert-like model from the Hugging Face library and achieve over 100 requests per second with latencies below 100 milliseconds for less than $250 a month.

Simple models and simple inference pipelines are much more likely to generate business value than complex approaches. When it comes to deploying NLP models, nothing is as simple as creating a FastAPI server to make real-time predictions.

While GPU accelerated inference has its place, this blog post will focus on how to optimize your CPU inference service to achieve sub 100 millisecond latency and over 100 requests per second throughput. One key advantage of using a Python inference service rather than more complex GPU accelerated deployment options is that we will be able to have the tokenization built-in further reducing the complexity of the deployment.

In order to achieve good performance for CPU inference we need to make optimisations to our serving framework. We breakdown down the post into:

Leave a Comment