How CI/CD is different for data science

Agile programming is the most-used methodology that enables development teams to release their software into production, frequently to gather feedback and refine the underlying requirements. For agile to work in practice, however, processes are needed that allow the revised application to be built and released into production automatically—generally known as continuous integration/continuous deployment, or CI/CD. CI/CD enables software teams to build complex applications without running the risk of missing the initial requirements by regularly involving the actual users and iteratively incorporating their feedback.

Data science faces similar challenges. Although the risk of data science teams missing the initial requirements is less of a threat right now (this will change in the coming decade), the challenge inherent in automatically deploying data science into production brings many data science projects to a grinding halt. First, IT too often needs to be involved to put anything into the production system. Second, validation is typically an unspecified, manual task (if it even exists). And third, updating a production data science process reliably is often so difficult, it’s treated as an entirely new project.

What can data science learn from software development? Let’s have a look at the main aspects of CI/CD in software development first before we dive deeper into where things are similar and where data scientists need to take a different turn.

CI/CD in software development

Repeatable production processes for software development have been around for a while, and continuous integration/continuous deployment is the de facto standard today. Large-scale software development usually follows a highly modular approach. Teams work on parts of the code base and test those modules independently (usually using highly automated test cases for those modules).

During the continuous integration phase of CI/CD, the different parts of the code base are plugged together and, again automatically, tested in their entirety. This integration job is ideally done frequently (hence “continuous”) so that side effects that do not affect an individual module but break the overall application can be found instantly. In an ideal scenario, when we have complete test coverage, we can be sure that problems caused by a change in any of our modules are caught almost instantaneously. In reality, no test setup is complete and the complete integration tests might run only once each night. But we can try to get close.

The second part of CI/CD, continuous deployment, refers to the move of the newly built application into production. Updating tens of thousands of desktop applications every minute is hardly feasible (and the deployment processes are more complicated). But for server-based applications, with increasingly available cloud-based tools, we can roll out changes and complete updates much more frequently; we can also revert quickly if we end up rolling out something buggy. The deployed application will then need to be continuously monitored for possible failures, but that tends to be less of an issue if the testing was done well.

Copyright © 2021 IDG Communications, Inc.

Source link