Scaling Data Science development

Hugo Luchessi
CodeX
Published in
8 min readMay 25, 2021

--

Around ten years ago, Software Development used to be way less productive than today. Things like code versioning (when present) was made using file locks or merging techniques that often caused code loss. Testing (also, when done) was executed manually. Deployment was zipping all the code and copying to the server via ssh, ftp or even physically with a flash drive. Horizontal scaling was buying new bare metal machines and allocating it in a refrigerated room (or under some developer’s desk).

With the rise of the DevOps mindset, most companies began to develop high level abstractions regarding the infrastructure underlying software and implementing processes to apply those concepts. Development methodologies, Infrastructure as Code, Continuous Integration, Continuous Delivery are now daily concepts created in this timeframe that boosted software development’s productivity to today’s scenario.

Machine Learning demand grew exponentially in the last years and now Data Scientists are facing the same productivity dilemma as Software Developers faced years ago. Exploring, training and deploying fast, may be the more appealing idea at first, but it may lead to a ton of trouble later. The commonly used term in software development “technical debt” can and should be used in Data Science as well.

Data Science development process

CRISP-DM process diagram

The CRISP-DM is a good template for understanding the Data Science development process. It outlines a cyclic process, from understanding the business problem to evaluating a model (or in a general sense, algorithm). The Data Scientists will probably run through this cycle multiple times and make deployments along the way.

Data Scientists spend most of their time exploring, researching and testing hypotheses. This is often where most of the code is built, using notebooks or personal scripts. The problem starts when these scripts and notebooks produced turn out to be the actual code of the production model.

This practice itself is not harmful, but it usually leads to some code problems, it also may lead to some technical debts and deployment is made manually. For the first delivery, this can mean no harm but in the long shot maintaining and evolving this model will be a major issue.

Companies MLOps teams (or areas) are normally just focused on model deployment and serving. But there are a whole lot of problems that Data Scientists have to deal with before the model is ready to be deployed. As well as DevOps, MLOps is concerned with defining concepts and building a platform for making not only the deployment, but the entire development process more productive.

The control process

We all know that Software Development processes are far from being perfect, but they are light years ahead from the Data Science Development processes so, for the sake of the argument, it will be used as the baseline.

Delivering code continuously is a complex process and demands some layers of standardizations and tools to automate and monitor those standards. Although it differs from company to company, there are usually 3 stages for Software Development.

Development stage where the developer is writing the code to achieve a business goal. The integration stage which consists in building and testing the code to check if it is ready to go to production. The delivery stage where the main goal is to deliver the new code to a production environment. This stage is also where we want to constantly monitor application health and business metrics.

Although it may appear to be simple, there are some complex challenges to be dealt with in all this process.

Collaboration and versioning

In a company, we must ensure that all developers are able to collaborate building the same application in the same source code. This is almost a given nowadays. Most companies already use some sort of code collaboration standard like git or svn.

Environment standardization

External libraries and other environment dependencies were often the cause of many problems in the production environment. Using containers and some package managers we could easily achieve this kind of environment replication to deliver the exact same settings of the developer’s machine to production.

Continuous Integration

Continuously ensuring code quality is a must have for companies that deliver code everyday (or even more than once per day). After the code in a centralized repository, it is much easier to implement a tool to automate processes that build, test and depending on the output, avoid bad code to be deployed in production. The output here is a validated package to be deployed to production.

Continuous deployment

Having the code tested and being validated by the Integration process, the package (a docker image for instance) must be deployed to the production infrastructure. There are many techniques to deploy this code, but that is not the point here.

Monitoring

With the code deployed we must continuously ensure that the code is not broken monitoring health and business metrics. This is also a very complex theme, but for this article knowing it exists is enough.

Development productivity

With Continuous Integration/Deployment, Software Development started making pipelines to build/test/deploy applications automatically. This practice led it to another level of productivity and quality.

The same process can be applied to Data Science but unfortunately we can’t use the exact same abstractions, because there are slightly different problems to be tackled.

For starters, as stated before, Data Scientists will spend most of their time exploring and testing new algorithms. Setting an environment safe to experiment is a very complex task and may take too much time.

Besides code, models also use data as part of its development. It means that aside from the code needed, you will use data (oftenly referred as train data) to build the model. This makes it very hard to check model evolution. If you train your model with the same code but different data, you may have different metrics. Code versioning will only control one variable of this equation.

Also, training Data Science models is a resource heavy process and often requires specific resources (large amounts of memory or CPU, GPU) and could also require an entire cluster to process. And this is a very model specific thing. Each model will have many different requirements.

With that in mind, we can see that the task at hand needs some slightly different abstractions than the Software Development process, but we can use the same principle. Major companies have developed their own abstractions using the CRISP-DM model as a template. But, to keep things simple, I’ll define stages of Data Science development as:

Exploration

Understanding business needs, testing hypotheses, understanding existing dataset and data quality. If there are any changes to be done on the available data, this is the time to trigger this change. This is also where requirements should be set. Any data preparation needed should be specified on those requirements.

For this stage we need to create an exploration environment where Data Scientists can experiment without harming other models or data. Many tools deliver this kind of environment like Databricks and Jupyter Notebooks. Those tools deliver a huge autonomy to Data Scientists, for exploring and experimenting.

Development

That exploration code may lead to a possible implementation of a model. But this code is usually done for that exploration task and may not be optimized. Also, without controlling version of code and data there will be no way to measure and evaluate the model quality.

There are some tools like DVC, that help create a fully versioned training pipeline. And it can use other pipeline orchestrators like Airflow or Flyte to distribute pipeline execution to ensure scalability.

With those tools we ensure the reproducibility of a training pipeline, which means that, anytime you run a version of a pipeline, the result will be the same or at least very close, given the non deterministic nature of a data science algorithm.

Evaluation

In software development we must create code that does what needs to be done and testing. In this case, we need to create code (and gather data) to train and test the model. But, there is no deterministic way to ensure that a model is better than the other.

In Software Development, the main reason to have a CI running tests is to avoid a bug reaching the production environment. That said, if there is no deterministic way to test a model, how can we avoid a bad model reaching production?

There are more focused articles in this matter and it is not the point of this article. To put things short, within the training pipeline, you have to gather quality metrics from a controlled dataset prediction. These metrics must also be versioned with the code and data. Then you compare these metrics to the last version in production of your model. The model will be a good fit for production if the metrics are better than the last one. Who defines what is better is the Data Scientist writing the code.

Model Serving and Monitoring

When your model reaches a better result than the one in production, it is time to take this new model to production. There are many complexities surrounding the model’s deployment process, I’ll just scratch the surface here. There are roughly 3 ways to deploy a model:

Embedded model

This is the most simple way to serve a model. You just put the binary file longside the application code and use it when the application needs it.

Advantages

  • Easy to implement

Disadvantages

  • There is no way to serve another version of the model without deploying the application as well
  • The responsibility to monitor the health of the model leaks to the application making it harder to monitor
  • If this model is used in more than one application, this embedded binary is duplicated throughout all those applications, making it even harder to deploy new versions

Serving as Data

This process aims to pre-score your entire dataset and save it as a key-value store. This is only viable if your dataset is not growing fast, because not all predictions will be available as they need to reach the key-value store.

Advantages

  • Easy to monitor as the scores are generated in CD time
  • Easy to deploy improvements without breaking interface
  • Predictions performance tend to be O(1) depending on the storage

Disadvantages

  • Served as database, there is a tight coupling between data and application, making it hard to deploy new versions of the model
  • Only works when you have limited inputs to predict. When the input is totally new, you will not be able to return a result

Wrapped in a service

This kind aims to embed the model in a facade layer. This layer will expose the exact same interface of the model and can be used by all applications in need to use this model. All health metrics will be encapsulated in this facade application.

Advantages

  • Enables real time monitoring
  • Loose coupling with applications and easier to release new versions (any api versioning applies here)
  • Easy to deploy improvements without breaking interface

Disadvantages

  • Hard to implement service applications
  • Depending on the model, it may lead to performance issues

As you can see, the type deployment really requires an analysis of the scenario. There is an optimized way to approach deployment for each scenario. There are also some tools to help you deploy a model like AWS SageMaker and MlFlow which can also be used within Databricks.

ML Ops

All those tools are good in what they are supposed to do, ML Ops comes in when you try to connect all those tools and concepts to build a fast paced, productive and reliable Data Science development process.

--

--