MLOps: An overview
Kumar Sambhav, Research Computing, NUS Information Technology
MLOps is quickly emerging as a critical component for data science projects at the enterprise level. It helps organisations achieve their short and long term goals and generate value for the organisation. MLOps have become a successful data science strategy and has recently generated a lot of interest.
The need for machine learning operations or MLOps arises from the fact that although the steps required to go from a business model to a machine learning model seems quite straightforward, the actual implementations take a lot of effort and careful management. Traditional organisations which have very few machine learning models to manage may not see its importance. But for those which have multiple models growing at different paces, it becomes imperative to have a strategic initiative to help them manage their AI growth. The reality is that the need for tooling, infrastructure and technology decisions for enterprise machine learning lifecycles are always quite complex.
The main challenges that machine learning projects face when done at scale are as follows:
- Dependencies are difficult to manage both in terms of data and in terms of technology.
- In a big organisation, not everyone works with the same set of skills and the same experience in languages.
- And at the enterprise scale, the entire machine learning pipelines have to be constantly tested for accuracy as well as integrability with the existing environment. However, data scientists are usually not equipped in software engineering And it is also not advisable to let the focus of data scientists shift towards dev OPS. Instead they should be focusing on core data science.
The critical difference between MLOps and DevOps is that in DevOps the focus is on software code whereas in MLOps both software and data (huge volumes of it) are the variable. They are constantly changing and adapting to newer inputs. MLOps being made of both code and data which makes it a unique discipline.
MLOps to Mitigate Risks
Publishing ML models to production without a proper MLOps infrastructure in place is risky due to many reasons. ML Models which have not been properly validated before being pushed to production can often lead to catastrophic failures.
The risks are usually higher for larger models and for the ones which are deployed widely and used outside the organisation. A few checks that the managers can do to mitigate the risks are:
- Check the model for input drift or in simple terms, checking whether the incoming data is a good representation of the problem at hand.
- Check the models for concept drift item or in simple terms, checking whether the model is relevant given the change in data.
- Checking the performance of the model to find if the model is getting the job done in quick time.
- Checking the model to see that resource utilisation does not drastically increase for a newer model.
The above mitigation tasks is a good starting point. Take note that they are not exhaustive and deep diving into the machine learning lifecycle is required to ensure seamless experience for the end users in production.