What is Machine Learning Data?
I’ve just finished reading the post Why Machine Learning Projects Fail: Data by @timhuckaby – insightful and informative with a healthy dose of wit (much like the man himself). Tim does a masterful job of illustrating why the data required to train a ML engine is distinct from most any other kind of data – and that it’s often extremely difficult produce that data. This had a particular resonance with me because it feels like one more example of a larger phenomenon – one not unique to machine learning and one that is not even all that new.
SNL's Not Ready for Primetime Players challenged us with "Is it a dessert topping or a floor wax?" Whatever your answer, most everyone agreed, we don't want one thing to be both. Do we really want to treat ML data like the rest of our data when it is certainly something more?
20+ years ago: Digital assets are more like our other assets than our other digital content. We spent a lot of time explaining that what made paper money special was that it was money – not that it was paper (rip a $10 bill in half, and you don’t have two $5’s). The “asset” property completely overwhelms the “paper” property such that paper money is created, exchanged, stored, etc. in a completely different manner than the rest of our paper. As this “asset” dynamic extended to digital assets (rather than digital files), Digital Asset Management (vs. document management) was born.
10+ years ago: Software code is more like our other IP than our other digital content. We spent a lot of time explaining that what made a programmer’s code special was that it encoded logic and IP – not that it was human-readable content (it’s why copyright protection is so absolutely inadequate and governance extends to code derivatives e.g. runtime binaries). The “code” property completely overwhelms the content property such that computer code is created, exchanged, stored, etc. in a completely different manner than the rest of our content. Voila! Source code management, build automation, and DevOps have all grown out of the practical and material requirements for the care and feeding of code over its complete lifecycle.
Today: Now, we are spending a lot of time talking about ML data requirements – but as long as we talk about these requirements as “data” requirements and not what they really are – requirements apart from data – we will struggle (and mostly fail) to reliably avoid the failures that Tim so eloquently describes. My sense is that ML data (data that can be used to effectively train a ML engine) is more than data. ML data has another property (analogous to “asset” or “logic”) that will ultimately overwhelm its “data” property in that we will be creating, exchanging, and storing ML data in a manner quite different from the way we store the rest of our data.
What is this new property? Well, that’s a very interesting question (to me at least). ML data is not exactly code in the traditional sense of the word, but it’s closer to code than it is to traditional test datasets used to test code.
I’ve already written about how technical debt can apply to ML data (Machine Learning and Technical Debt) and the “specialness” of ML data was at the core of a policy piece I co-authored (Machine Learning and Medical Devices Connecting practice to policy: pg 4).
ML data is a kind of code (it's code-ish) – but its “logic” is not explicitly expressed (or even predetermined) and cannot be statically analyzed. It’s why debugging ML data for bias and other defects is so difficult.
Is this an insurmountable obstacle? I prefer to see it as a massive opportunity