Speedrunning ML Ops Let's quickly learn how to do ML Ops

submited by
Style Pass
2025-01-07 22:00:18

My background is in building custom DevOps workflows for developing and testing AI accelerators, which I guess is a specific kind of ML Ops. But I've never sat down to make a study of it, so here we are.

In DevOps you want to manage the entire developer workflow from coding to deployment in production. Doing this means tracking the following automatically:

ML Ops is the same, but requires also managing the machine learning workflow. Whereas DevOps just manages code and code artifacts, ML Ops manages:

These three things are the elements of any ML Ops system. And they compound with each other, meaning you'll have around 33 combinations of changes to track.

Let's assume we have access to GPUs somewhere, like in a data center or cloud cluster. And assume that the code is tracked with git as is common in developer workflows already. There are still two major sources of variance in the workflow that need to be locked down:

In tutorials and docs you often see datasets being pulled from pytorch or sklearn or huggingface directly. This works for tutorials, but for production systems you need be sure that the data you train with doesn't change from under you: the content of the data must be hashed and tracked in some stable storage.

Leave a Comment
Related Posts