COVID-19 and Machine Learning Technical Debt
We know to expect massive financial debt thanks to the COVID-19 pandemic – but what about the technical variety?
Back in October writing about the opportunity to build ML-based services, I wrote
“Mining, curating and cleaning data that’s already sitting out there waiting to be found is the new gold rush.”
I still stand by that general sentiment, but now I find myself asking a very different set of questions all relating to the relevancy of “pre-COVID-19 data” in a post-COVID-19 world.
Consider the following real-world scenario - I'll connect it to consumer behavior, supply chain management, medical diagnostics, etc. later.
I trained a model using a decade’s worth of data from a yoga studio to predict which of their current students would be most likely to cancel their auto-pay monthly subscription.
When correlations no longer correlate
In my “curation,” I ended up including attendance, financial history, retail purchases, and a few other types of studio connections – the specifics aren’t relevant here but suffice it to say, my model was trained on a web of behaviors that each correlated with the model's ultimate prediction (will or won’t cancel monthly renewal).
While relevant, I really have no idea to what extent any of these factors were causally (not casually) connected to the predicted outcome. This turns out to be pretty important because, in a post COVID-19 world, I suspect that many of the “correlations” that made my model work will no longer correlate (at least to the same degree).
I am NOT thinking about direct pandemic fallout, e.g. the current “stay at home” regimen (obviously, all in-studio classes are canceled) or even the near-term spike in unemployment (still direct fallout).
I’m thinking about permanent changes in behavior. For example, it turns out that, in order to support this studio, quite a few yoga students have volunteered to be billed even though there are no classes. Separately, the studio, in order to support their students in these highly stressful times, has been streaming free classes via FB-live (they can still see who attends). SO???
Knowing the data as I do, I can predict with some confidence that this novel (sorry – no pun intended) financial and attendance activity would change my current trained model – but what I can’t say with any confidence is to what degree the model would change.
OK, you would be right to point out that while this is all based on changing behaviors, it’s still just a few net new properties – and this kind of thing (discovering new properties that change a model’s behavior) happens all the time. I think the “COVID-19 effect” is more pernicious.
Will students increase their appreciation of yoga because of the role it played in mitigating today’s profound emotional stressors? This may invalidate the relative weight of prior attendance in my current model.
Will the near-term economic instability translate into a lasting sense of financial insecurity (as it did for those who lived through the great depression)? This may invalidate the relative weight that credit card declines, etc. have in my current model.
I think that, at the very least, my pre-COVID-19 model must be retested and verified against our “new world.” With a little luck, my model will still be “good enough,” but what if it isn’t?
How many months/years of “new data” will I need? Or (perhaps more likely) will I need to simply rethink my “curation” and perhaps my selection of ML model and/or it’s configuration and simply re-implement (and replace) my current model with an improved one that accounts for a pre and a post COVID-19 world?
In other words, I need to reevaluate my model and be prepared to pay off my newly acquired technical debt.
ML technical debt
This is not a new concept. In the paper Machine Learning: The High-Interest Credit Card of Technical Debt, Google Research offers an excellent (IMHO) framework for characterizing ML technical debt that includes the section Dealing with Changes in the External World – BUT it is the shortest (and last) section of the paper and comprises only 3 of the 14 enumerated classes of technical debt.
In my (granted, cursory) review of other papers on ML DevOps and lifecycle management, I haven’t seen this topic get anywhere near the visibility I suspect it now merits.
In fairness, who could have predicted this kind of seismic shift in the “External World” let alone how far, wide and deep resulting real-world changes will ultimately reverberate – forever breaking and changing coincident “correlations?”
Consumer behavior – supply chain management – financial risk – medical diagnosis – this will be relevant to a far wider audience than my local yoga studio…