Understanding the Azure MLOps Framework

Problem

The DevOps CI / CD paradigm is prevalent across a variety of Azure PaaS services. DevOps for Machine The DevOps CI/CD paradigm may be found in a number of Azure PaaS applications. MLOps, or DevOps for Machine Learning, allows Data Science, IT, and Business teams to accelerate model creation and deployment by using automated continuous integration and deployment best practices. Machine learning models can be monitored, validated, and governed using well-designed MLOps paradigms. Large, geographically dispersed enterprises are looking for ways to successfully deploy MLOps Frameworks for multiple Data Science teams across their organizations. As they champion MLOps across their businesses, they want to understand Azure MLOps Frameworks and patterns of success.

Solution

Azure Machine Learning, coupled with Azure DevOps makes the concept of the Azure MLOps Framework a reality. The capability of this framework supports its provisioning through automated CI/ CD ADO pipelines, potentially triggered by ServiceNow. With this, a Data Science team can submit a request via their ServiceNow ticketing system to have the MLOps Framework created for them. Upon creation, they would have immediate access to a variety of environments including Experimentation, Development, Staging / QA, and Production.

Typically, the idea is to split the provisioning into two critical steps: 1) Provisioning of the Experimentation environment and 2) Provisioning of the remaining environments within the framework once experimentation is complete and the team is ready for productionizing the models. The AML Framework would ideally be connected to a centrally managed Data Hub which would facilitate the egress of test and training batch and real-time data. Additionally, the Data Hub would support the ingress of model inference data.

Read Also: Learn App Development With Swift

The AML Framework would also be connected to a centrally managed AML Model Management account which would store and track versioned ML models that are destined for production outside of the AML Framework for added security and backup best practices. All of these various AML Framework environments will be interconnected with DevOps CI / CD pipelines and source repositories to allow for seamless and automated deployment of models from initial experimentation to production-ready web service endpoints. In this article, you will learn more about this MLOps Framework by gaining a deeper understanding of its architecture, the personas involved, the various use cases for each environment, and the key features of the framework.

Framework

The Figure below illustrates the MLOps Framework which includes an Experimentation, Development, Staging / QA, and Production environment. These environments are interconnected through Azure DevOps CI / CD pipelines along with Git source repository integration every step of the way.

Experimentation

  • Develop the feature engineering code/steps required to pre-process the data for our model.
  • Develop the training script that trains the model on the prepared data.
  • Develop the initial inference scripts.
  • Create initial pipelines for the training and scoring of the models.
  • Promote these to dev.

Development

  • Train model with training pipeline automatically triggered on the promotion to dev.
  • Run inference against the model using the scoring script, automatically triggered on the promotion to dev
  • ML Engineer / Data Scientist can then improve the code quality of the pipelines if required before running a final training for the model.
  • The model artifact is then registered in dev and can be flagged using metadata tags for promotion into the upper environments along with the scoring scripts.
  • Trigger a promotion to pre-prod which takes the flagged model (s) from dev along with their scoring scripts and copies them through to pre-prod.

QA / Production

  • Deploy the model and scoring scripts to the managed endpoint service (or AKS)
  • The target location for input data should be specified in the inference pipeline.
  • The schedule for inference should be specified in the inference pipeline for each deployed model and scoring script combination
  • Ideally, for batch implementations, there would be the ability to configure messaging that occurs between the Central Data Hub and the deployed models that tells the framework when a new triggering file has landed in the hub and automatically commences the inference process and then writes the data back into the hub.
few personas that are critical to the MLOps Framework

There are a few personas that are critical to the MLOps Framework. These personas represent the key players in the MLOps Framework across the various environments.

  • Data Owner: Responsible for the data being used within the framework to create data science solutions.
  • Data Scientist: Responsible for taking the initial data, performing exploratory analysis, building and testing models, and selecting models (in conjunction with the business stakeholders) that should be considered for production.
  • ML Engineer: Responsible for both software development and Data Science and for taking the work of the Data Scientist and getting it into a production-ready form.
  • Data Science Product Owner: Responsible for overall ownership of the project that is being built and has responsibility for the quality of the eventual output and utility of the Data Science product created and put into production.
  • Automation Account: This persona is not a person but an automated account used to reduce manual interaction with the framework accounts once models are promoted beyond the development environment.

The table below illustrates the main persona’s purpose of interactivity between the various environments. As the MLOps process progresses through the higher environment all the way to Production, the manual interactivity decreases as the CI/ CD manages most of the automated deployment steps.

Azure MLOps Framework Features

Features

Cookie Cutter Templates: A Data Scientist will have access to custom AML cookie-cutter templates as part of the ADO Framework provisioning process which provides standard patterns of success to develop machine learning solutions in Azure using the AML workspace. The template provides features such as:

  • Utilities to create conda environments for training, validate and register models, configure data drift, and call real-time endpoints.
  • training folder with the ability to submit training to training clusters.
  • scoring section that provides a template to develop scoring/inferencing code. Azure DevOps Pipelines to deploy models, evaluate multi-stage models, register batch pipelines etc. Monitoring scripts and pipeline connectivity scripts.

Model Development: The AML Workspace supports custom model development. A Data Scientist will be able to use code-driven development software such as Visual Studio Code to develop and test their models in addition to and as an alternative to the AML Workspace.

Model Evaluation: Evaluation tests compare the new model with the existing model. Only when the new model is better does it get promoted. Otherwise, the model is not registered and the pipeline is canceled.

Model Registry: In addition to being registered in a central model registry account, models can also be registered locally with the development environment.

Model Operationalization: The process to operationalize the model will be orchestrated and triggered by an Azure DevOps model artifact trigger within the training pipeline.

Model Training & Re-Training: The training Python script is executed on the AML Compute resource to get a new model file which is stored in the run history. The AML pipeline within the Development environment orchestrates the process of both training and retraining the model in an asynchronous manner. Retraining can be triggered on a schedule or when new data becomes available by calling the published pipeline REST endpoint. Once new data is passed to the model deployed model on the managed endpoint, the AML pipeline within the development environment will initiate the process of re-training the model, versioning the improved model in the central model registry account, and re-deploying the model into the pre-production environment.

Model Inference Data: With bi-directional connectivity to the Azure Data Hub, model inference data can be written back to the hub for further analysis, and new real-time and batch training data can be passed to the deployed model on the managed endpoint.

Central AML Model Management: In an effort to back up and retain the most relevant models that are destined for production, a Central AML Model Management account residing outside of the AML Framework within a separate subscription will retail these versioned, production-ready models. Additionally, this central model management account could support experiment tracking and more.

Testing. Code Quality tests ensure that the code conforms to the standards of the team; Unit tests make sure the code works, has adequate code coverage, and is stable; Data tests verify that the data samples conform to the expected schema and distribution. Customize this test for other use cases and run it as a separate data sanity pipeline that gets triggered as new data arrives. For example, move the data test task to a data ingestion pipeline so you can test it earlier.

Scalable AML Compute: A Data Scientist will be able to manually create and scale AML Compute as they develop and test models in the experimentation environment. AML Computer is a cluster of virtual machines on-demand with automatic scaling and GPU and CPU node options. Users will be able to specify size and lifecycle rules for the AML Compute.

Code Repository: A Data Scientist will be able to link their experimentation environment to a code repository such as GIT which will be linked to an Azure DevOps CICD process. This process will check the code into the artifact repo and trigger a CI Build pipeline, a training pipeline, and a CD Release pipeline which will ultimately deploy the mode to the higher environments (DEV, Staging / QA, PROD) with gated manual approvals along the way.

Access to Data: A Data scientist will be able to perform exploratory analysis of their data by registering data sets and visualizing the data. They will have access to internet data via online code repositories, AML Packages (PiPy, Maven), and more. Secure spoke-to-spoke virtualized connectivity can be established from the Experimentation environment to the Central Data Hub account. With this connection, a Data Scientist will be able to create a Data Store within their local Experimentation account with a virtualized connection to the Data Hub account. Also, a Data Scientist will be able to upload their own data within their local experimentation environment’s Azure Data Lake Storage Gen2 account. From there they can create Data Stores and register Data Sets to begin working with the data.

Azure machine learning workspace
  • A Data scientist uploads the data into their local Data Lake Storage Gen 2 account within the framework.
  • A Data scientist mounts the data into the Compute Instance for data preparation and feature engineering.
  • The data store is registered in the AML workspace.
  • The data set is created using the above data store.

When a training run is scheduled, the dataset is mounted to the training cluster. During training, the data is read from the Data Lake Store.

Feature Engineering: Feature selections and calculations can be achieved through the AML Workspace by applying statistical tests to inputs, given a specified output, to determine which columns are more predictive of the output.

Secure AML Workspace Access: A Data Scientist and ML Engineer will have secure access to the Experimentation AML workspace and associated resources which will be secured inside a virtual network using private endpoints. These services can only be accessed via VPN. Direct access to the internet is blocked. Additionally, the Data Scientist and ML Engineer can securely access the AML workspace and associated resources using Azure Active Directory’s role-based access control (RBAC).

Data Encryption: A Data Scientist and ML Engineer will have secure, encrypted access to data in all datastores via encryption at rest and during transport via TLS.

Resource Locks: Data Scientist and ML Engineer will have peace of mind that their resources will not be accidentally or intentionally deleted through ‘Resource Locks’ which will be used to protect resources from accidental deletion.

Azure Monitor: SLA endpoint metrics such as throughput and latency along with CPU/GPU utilization can be monitored with out-of-the-box integration. Alerts can be set for threshold breaches. Azure Monitor supports Application Insights and Log Analytics for deeper performance, trend, and issue identification and analysis. Performance can be analyzed by enabling integration with App Insights. Issues and Trends can be identified and analyzed with Log Analytics. Also, development details can be monitored through the AML Workspace UI within each environment.

Managed Endpoint Model Deployment:

  • Online/Real-Time Scoring: With Managed Infrastructure, the provisioning and hosting of computing, management of host OS images, and system failure node recovery are all managed by Azure. Also, blue/green deployments support safe rollouts of new versions of models under the same endpoint and gradually divert traffic to it by validating that there are no errors and disruptions. Secured resources, can be accessed by both user-assigned and system-assigned managed identities Debugging via local Docker environment endpoints is supported. Managed Endpoints are integrated with Azure cost analysis to view costs at the endpoint and deployment levels.
Real-Time Scoring
  • Batch Scoring: Batch endpoints support batch inferencing with no-code deployment for models registered with MLFlow. Once the model and compute target are specified, the batch endpoint will be ready for use, and the scoring script and environment will automatically be generated. Managed batch endpoints support AML registered source datasets, internet data sources, and locally stored source datasets. The outputs can be set to any data store. Once a batch inference job is triggered by a batch endpoint, the compute resources are automatically provisioned when the job begins and de-allocated once the job completes. There is also the capability to override compute resource settings such as for instance count, mini-batch size, error threshold, and more for the batch inference jobs to enhance performance and manage cost.

Environments

In the following sections, you will learn more about the details, use cases, architecture, and interactivity between each of the environments within the MLOps Framework: Experimentation, Development, Staging / QA, and Production. Additionally, you will learn about Azure DevOps’ CI / CD role within each of these environments.

Experimentation

The first environment of the MLOps Framework is the Experimentation environment, which is designed to have access to a full suite of Data Science tooling and is designed to be highly interactive. In this environment, a Data Scientist will access data for the project, explore this data, build and test models, tune hyperparameters, and select promising models for promotion to development and onwards.

A Data Scientist working in the experimentation environment will need to be able to create AML Datastores in their workspace to allow virtualized access to project data stored within a central Azure Data Hub in order to have a single location within the environment with access to project data which can be used to register AML datasets, visualize the data, and perform exploratory analysis to better understand the project data.

The Data Scientist will need to be able to save working copies of data along with intermediate analysis outputs back to a local storage account within the experimentation environment in order to rapidly iterate analysis without having to run full transformations on the raw data each time changes are made. They will also need to be able to create training and test data sets for supervised learning tasks and write these back into a central Azure Data Hub to preserve work that could be used to train a production model in a persistent environment within the project. Examples of data written back could be tabular data with a target column containing manual labels, images with object labels, and bounding box coordinates.

A Data Scientist working within the experimentation environment will need to be able to formalize feature engineering steps, code quality, and data quality into scripts in order to deploy them as part of both the training and inference pipelines for a model promoted beyond experimentation. This will allow for the formalizing of the experimental code into a deployable training process.

Architecture

The Figure below contains the subset of the Azure MLOps Architecture diagram as it relates to the specific details of the Experimentation environment and its interaction with a central Data Hub, and with the DevOps CI/CD pipeline to publish, commit, and prepare models for the promotion process into the higher environments (DEV, QA & PROD).

A Data Scientist can establish a virtualized connection to a Data Store residing within a central Data Hub in the same production subscription. Additionally, they can register datasets using this Data Store. The Data Scientist will also have access to a local Azure Data Lake Storage gen2 account within their experimentation environment along with access to Internet-sourced data. Once they complete the development and testing of their models, the Data Scientist can commit the code to a linked code repository account which will then trigger a build pipeline using the artifacts. This will initiate the process of deploying the models to the higher environments (DEV, QA, PROD).

Experimentation Architecture diagram details as it relates to Experimentation environment

Azure DevOps CI/CD – Continuous Integration (CI) Build Pipeline

The CI pipeline gets triggered every time code is checked in to the GIT repository. It publishes an updated AML pipeline after building the code and running a suite of tests. The build pipeline consists of Code Quality, Unit, and Data tests.

A build pipeline on Azure DevOps can be scaled for applications of any size. Build pipelines have a maximum timeout that varies depending on the agent they are run on. Builds can run forever on self-hosted agents (private agents). For Microsoft-hosted agents for a public project, builds can run for six hours. For private projects, the limit is 30 minutes.

Development

The second environment of the MLOps framework is the Development Environment. An ML Engineer that is working in the Development environment will need to be able to enhance the quality of the code written by the Data Scientists which has been published from the experimentation into the development environment to make it more efficient and remove un-necessary package references in order to prepare the code for deployment to production. The ML Engineer will need to re-factor and add new unit tests to the code written in experimentation in order to improve test coverage if required and make the code base more loosely coupled in order to prepare the code for production and ensure that it is more easily maintainable.

The ML Engineer and Data Scientist that are working within this development environment will need to be able to run training pipelines in order to train models that can be promoted to production. They will also need to run inference pipelines that have been written using the models generated from the training pipeline against test data in order to produce inference results and validate these before the model gets promoted into the pre-production environment. Being able to trigger the deployment of these pipelines and models into the Staging / QA environment via Azure DevOps (ADO) pull requests will be critical to reduce the time taken to promote code and ensure that it is done in a controlled fashion.

The ML Engineer working in the Development environment will need to be able to register the model(s) developed within the environment within a model registry in order to make it available for deployment for inference in the pre-production and production environments. They will also need to be able to manually trigger retraining of an existing model on revised training data in order to update the model to ensure its accuracy. Additionally, they will need to be able to specify a time-based re-training schedule for models so that it is automatically updated on a schedule in order to maintain the accuracy of models. Versioning will be critical and the ML Engineer will need to be able to specify the exact model version and name that will be promoted to pre-production (for example by the use of a metadata flag in the model registry) in order to ensure that the correct model is promoted from development into pre-production.

Architecture

The Figure below contains the subset of the Azure MLOps Architecture diagram as it relates to the specific details of the Development environment, its interaction with the Data Hub, Model Registry, and with the DevOps CI/CD pipeline promotion process into the remaining upper environments (QA & PROD).

Once the approval process is completed from the Experimentation environment, promising models will be promoted to the Development environment where the ML Models will be trained using managed AML compute, evaluated, and registered both locally and within a Central AML Model Registry account within a production spoke. The primary role which will facilitate the model hardening within this development environment will be the ML Engineer.

Interaction with the Data Hub will be permitted within this environment from the perspective of establishing connectivity to the Data Hub via Data Stores and Datasets. This Development environment will support both manual and automated re-training pipelines. The desired model which is stored to the AML Model Registry account can be operationalized via a model artifact trigger of a manual gated approval process through the framework’s project ADO Continuous Deployment (CD) release pipeline. This will then deploy the trained and evaluated model into the Staging / QA environment and will then progress through further processing and pipeline promotions in the remaining upper environments (Staging/QA and PROD).

development architecture

Azure DevOps CI/CD – Training Pipeline

Since the purpose of this Development environment is to support model training, evaluation, and registry, the process to operationalize the model will be orchestrated and triggered by an Azure DevOps model artifact trigger within the training pipeline. This pipeline will include a manual gated approval process which will then promote the model into the higher environments (Pre-Prod and Prod) through a Continuous Deployment (CD) release pipeline. It is within the Staging / QA and Prod environments where the model will be packaged into images, scored, and tested. This model inference data will then be saved to the Data Hub.

Staging / QA

The purpose of the Staging / QA environment is to host and test models that are under consideration for promotion to production deployments. The intent is to run these models in realistic settings so that their performance under production conditions can be assessed. The pre-prod environment assumes that code has been developed, tested, and checked into the development branch that has then been promoted to the Test environment through manual DevOps gating procedures. This environment is designed to be entirely automated with little to no manual user interaction.

An automation role within the Staging / QA environment will be able to deploy the inference pipeline designed in the development environment to perform feature engineering on incoming inference data and push this data to a model hosted in either Container Instance, or Azure Kubernetes Service (AKS) or Managed Endpoint. This process will generate inference results that can be written back to the Data Hub and be available for consumption by end-users. Through an automated process controlled by code merges and pull requests into the pre-production environment, an ML Engineer will be able to deploy models trained in the development environment and stored within the model registry along with any pre-processing pipelines so that it can perform inference on new batch and streaming data landing in the Data Hub.

Additionally, automated inference batch job pipelines will be triggered so that when new data lands in the Data Hub, the pipelines will be triggered and inference will run on the new data. Data Scientists and Product Owners will have the capability of manually approving models promoted into production to ensure that only models that are manually reviewed and approved are promoted into the production environment. Data Scientists will be able to view the output inference results from models hosted in the pre-production environment and compare them against the output of models hosted in the production environment in order to validate that the new model results are an improvement over the existing model before it is promoted to production.

Architecture

The Figure below contains the subset of the Azure MLOps Architecture diagram as it relates to the specific details of the Staging / QA environment and its interaction with the Data Hub and with the DevOps CI/CD pipeline to publish, commit, and prepare models for the promotion process into the production environment.

This environment is intended to be fully automated through the Azure DevOps continuous integration (CI) release pipeline which will be triggered by a manual approval gate from the development environment. The pre-production ready model will be packaged into a Docker image and deployed as a web service on Managed Endpoint, which is an Azure PaaS service offering that is comparable to AKS with the difference being that the infrastructure of Managed Endpoint is fully managed by Azure and also supports blue/green deployments. Additional tests can be written within this environment.

Additionally, through the AML Workspace within this environment, batch scoring can be achieved through managed batch endpoints and data drift can also be monitored in this workspace. Once the model is deployed to a managed endpoint, model inference data can be written back to the Azure Data Hub. The connectivity to the Azure Data Hub will be bi-directional, enabling both batch and real-time data to be sent to the managed endpoint scoring service. This will then initiate a trigger a re-training pipeline within the development environment to ensure that the most accurate and up-to-date model is stored in the central model registry account and re-deployed in the pre-production environment. Finally, Azure Monitor can be used within this environment to track metrics such as throughput, latency, CPU/GPU utilization, performance, cost, issues, and trends.

Stating QA architecture

Azure DevOps CI/CD – Continuous Deployment (CD) Release Pipeline

The release pipeline is intended to operationalize the scoring image and then safely promote it across higher environments and is subdivided into the Staging / QA and Production environments. This section will cover specifics of the release pipeline in relation to the Staging / QA environment.

  • Model Artifact Trigger. Release pipelines are triggered each time a new artifact becomes available. When a new model is registered and versioned within AML Model Management, it is treated as a release artifact and a pipeline is triggered for each new model which is registered.
  • Scoring Image Creation. The registered model is packaged together with a scoring script that will need to be created along with Python dependencies into an operationalized Docker image.
  • Container Instance, Managed Endpoint, or Azure Kubernetes Service (AKS) Deployment. For testing purposes, a non-production deployment of the scoring image can be created using either Container Instance, AKS, or Managed Endpoint.
  • Web Service Testing. API tests ensure that the image is successfully deployed.

Production

The final environment of the MLOps Framework is the Production environment, which is very similar to the Staging / QA environment from a functionality perspective, with the exception that this environment supports and hosts production-ready model endpoints that are available for consumption. This environment will host models in production and supply the production level inference data back to the data hub for consumption by the business. The production environment will be capable of writing back the output of the model in production to the Azure hub in order to provide access to this information for the business.

Architecture

Quite similar to the Staging / QA environment, the architecture of the Production environment will also support the deployment of scoring images as a web service at scale on either AKS or Managed Endpoints. This environment will also support testing of the web service to ensure that the image is successfully deployed. Models that are being productionized will also be written back to the Central AML Model Management account within this stage of the process.

production architecture

Azure DevOps CI/CD – Continuous Deployment (CD) Release Pipeline

Similar to the Staging Environment, the release pipeline within the production environment will operationalize the scoring image and then make it available for consumption via a web service endpoint.

Resources: mssqltips.com

Ruben Harutyunyan

Back to top