Optimizing machine learning models to achieve greater levels of efficiency continues to be a challenge for many applied machine learning applications.

Reducing CPU usage in Machine Learning model inference with ONNX Runtime | Inworld AI Blog

submited by

Style Pass

2022-10-05 23:00:06

Optimizing machine learning models to achieve greater levels of efficiency continues to be a challenge for many applied machine learning applications.

At Inworld, the need to power our AI-driven characters in real-time conversations and interactions through our 20 machine learning models requires high performance with low inference latency. We are also always seeking to reduce our hardware utilization but need to do so in ways that don’t significantly impact our application’s latency. After all, no one wants an AI character that pauses too long before answering!

Recently, when we developed a new service with a small ONNX model, we found it had abnormally high CPU usage and decided to find the reason for it. In the end, we were able to reduce CPU usage from 47% to 0.5% without significantly increasing latency.

In this post, I’ll walk you through the problem we discovered and how we fixed it. For the purpose of the post, all tests are written in Python but similar improvements can be achieved by using any language supported by ONNX Runtime.