Making the data you need is a differentiator in Gen AI

submited by
Style Pass
2024-04-13 18:30:05

The progress we’re seeing in Gen AI is skewed heavily towards what can be done without getting up from the computer. This means optimizing training algorithms and, at best, automated data curation. Most are “takers” – stuck with what data is already available and trying to make the most of it. Companies that are winning, particularly OpenAI, are making the data sets they need. A focus on data, particularly data that you need to leave your desk to get, is what’s going to continue to differentiate Gen AI offerings.

At my last company, we were improving computer vision AI for manufacturing problems. The challenge was a lack of data. You’d think manufacturers would have lots, but it often ended up being millions of pictures of the same thing over and over again. This makes it challenging to build a robust model that handles edge cases. We spent a lot of time and effort trying to algorithmically engineer better models, and made incremental but limited progress. We tried synthetic images, adding data from public data sets, better pre-training, techniques to force models to generalize, and more. Then one day I got a camera and started taking pictures for myself. I was interested in “surface defects” like cracks, scratches, discoloration, chips, etc. So I started by going outside and taking pictures of anything that looked like a defect.

Almost immediately, with a few dozen good images collected and labeled, I could build a model that generalized better than any of the past things we’d been trying. I spent a bit more money and bought some random metal parts from the hardware store and some craft supplies and created, photographed and labeled a whole set of images of the defects I was interested in. And built a set of models that outperformed anything I’d seen and were able to identify defects in different classes with no input data from the customer. (Without going on too much of a tangent, it’s fun to note that the GPU cost of generating specialized training data was often as much or more than going out and building physical examples of what I wanted.)

Leave a Comment