A growing array of techniques apply machine learning directly to the problems of data wrangling. They often start out as open research projects but th

Automated Data Wrangling

submited by
Style Pass
2021-05-27 12:00:03

A growing array of techniques apply machine learning directly to the problems of data wrangling. They often start out as open research projects but then become proprietary. How can we build automated data wrangling systems for open data?

We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different kinds of messes that go well beyond what you can easily fix with a few nifty dataframe manipulations. It seems like we should be able to develop higher level, more general tools for doing automated data wrangling. Given how much time highly skilled people pour into this kind of computational toil, it seems like it would be very worthwhile.

There are multiple open source machine learning frameworks with big user communities. Just in Python there’s at least scikit-learn, PyTorch, and Keras + TensorFlow. There’s also a seemingly endless supply of online courses, blog posts, and tutorials about working with these tools. But to do any good, they all need data. Lots and lots of nice, clean data, and people seem way less excited to talk about how you get that data. So much so that a popular O’Reilly book on the topic is entitled “Bad Data: Cleaning up the data so you can get back to work” as if cleaning up the data isn’t even part of the work! So much so that earlier this year researchers at Google recently published a paper entitled: Everyone wants to do the model work, not the data work. In particular the Google researchers looked at how data issues propagate downstream in “high stakes AI” applications, where machines are potentially making life-and-death decisions, instead of just trying to get you to click on an ad for useless gadget. And from the horror stories in Cathy O’Neil’s book Weapons of Math Destruction and the recent film Coded Bias, it seems like there are an awful lot of “high stakes AI” applications quietly running in the background affecting all of us, mostly without our knowledge…

Leave a Comment