A woman is frustrated by the answer given by a digital assistant.
The last year has seen no shortage of unprecedented circumstances. All aspects of our lives, from work to travel to shopping, have changed. During this massive disruption, we have (unfortunately) learned why ML Ops – the practice of machine learning (ML) in production and the management of an ML lifecycle, should not be an afterthought but rather a critical element of getting value from AI.
So – what happened?
Figure 1 below shows a simplified example of an AI model in action. First trained by data – past examples of the environment, the model is then put into the real world to make predictions on new inputs – which are implicitly assumed to be sufficiently similar to what the training examples were. With COVID, many scenarios occurred that were unlike anything that occurred in the past.
Figure 1: AI uses data to train a model. The model is then used to predict answers to new questions … [+]
For example, last year, I noticed that an online retailer’s website had started recommending baking goods to me regardless of what product I was viewing – even though I had never bought any such product from this retailer. A plausible reason is that the AIs powering the product recommendations has never seen the kind of rampant purchase of baking goods as had recently occurred, and was unable to make reasonable adjustments to recommend good related products given this abrupt sea-change in buying patterns. Is this acceptable or unacceptable? Depends…
Most AI will make predictions for any input data that comes in. Since ML is by definition non-deterministic, a wide range of answers is “acceptable”. However, ML is quite capable of providing very unacceptable answers. The question is when do we go from the edge of acceptable to entirely unacceptable? How do we detect this, and how do we fix it?
Where do MLOps fit in?
While COVID-19 may have brought such events to many companies at the same time, they are expected events in the life of a production ML service. MLOps is the practice of Machine Learning in production, covering, among other things, the behavior and diagnostics of production ML and its relationship to other stages of the ML lifecycle – such as training and origin data.
In initial breakdowns of ML Ops areas – my team and I at ParallelM called this particular area that the COVID failures highlighted – as ML Health – i.e. the notion of ensuring that production ML operates correctly in the face of real-world unexpected issues. ML Health includes monitoring, managing, and root-causing ML issues in production.
COVID-triggered behavior patterns are causing an ML Health issue called Drift. Many types of AI learn from examples. AI studies these examples to learn patterns that are codified as Models. The Models are then used to make new predictions for new data. While this approach is incredibly powerful – the core assumption is that past data contains patterns that are appropriate to use for new predictions. Drift occurs when this core assumption breaks down.
So – how can COVID-19 cause drift? For example, restaurants being closed has likely changed the grocery purchase patterns of many restaurants, resulting in capacity forecasting AI applications getting very different inputs now than what was historically the case for this time of year.
This type of problem does not just occur during worldwide pandemics. Simple mistakes can cause this problem too. For example, if your AI takes temperature as input and was trained on Fahrenheit, accidental entries of temperatures in Celsius will generate drift.
Drift can do anything from triggering hidden bugs in your prediction code to generating sub-optimal predictions. Unlike other types of software that will either fail or generate errors, Drift-caused AI prediction failures are silent, meaning that your AI will continue to make bad predictions, causing downstream applications to behave suboptimally or even generate business or legal risk.
But this will go away when COVID-19 goes away – right?
No. This kind of AI problem is endemic to how AI works. COVID-19 caused a massive business disruption and triggered many instances of Drift, but Drift can occur anytime that a business’ assumptions of the future do not match its history of the past. As we come out of the pandemic, we will be in a third uncharted territory, not like the last year but not exactly like the pre-pandemic world either.
Protecting your business from Drift related risk
For businesses that rely on AI for anything from product recommendations to supply chain or capacity planning, these kinds of Drift can have disastrous fiscal consequences. So what can businesses do?
I am an entrepreneur and technologist in the AI space and the CEO of Pyxeda AI. Previously, I co-founded ParallelM and defined MLOps (Production Machine Learning and Deep