series in reducing the time to value of your projects (see part 1, part 2 and part 3) takes a less implementation-led approach and instead focusses on the best practises of developing code. Instead of detailing what and how to code explicitly, I want to talk about how you should approach development of projects in general which underpins everything that has been covered previously.
Introduction
Being a data scientist involves bringing together lots of different disciplines and applying them to drive value for a business. The most commonly prized skill of a data scientist is the technical ability to produce a trained model ready to go live. This covers a wide range in required knowledge such as exploratory data analysis, feature engineering, data transformations, feature selection, hyperparameter tuning, model training and model evaluation. Learning these steps alone are a significant undertaking, especially in the constantly evolving world of Large Language Models and Generative AI. Data scientists could devote all their learning to becoming technical powerhouses, knowing the inner working of the most advanced models.
While being technically proficient is important, there are other skills that should be developed if you want be a truly great data scientist. The chief amongst these is being a good software developer. Being able to write robust, flexible and scalable code is just as important, if not more so, than knowing all the latest techniques and models. Lacking these software skills will allow bad practises to creep into your work and you will end up with code that may not be suitable for production. Embracing software development principles will give a structured way of ensuring your code is high quality and will speed up the overall project development process.
This article will serve as a brief introduction to topics that multiple books have been written about. As such I do not expect this to be a comprehensive breakdown of everything software development; instead I want this to merely be a starting point in your journey in writing clean code that helps to drive forward value for your business.
Set Up Your DevOps Platform Properly
All data scientists are taught to use Git as part of their education to carry out tasks such as cloning repositories, creating branches, pulling / pushing changes etc. These tend to be backed by platforms such as GitHub or GitLab, and data scientists are content to use these purely as a place to store code remotely. However they have significantly more to offer as fully fledged DevOps platforms, and using them as such will greatly improve your coding experience.
Assigning Roles To Team Members In Your Repository
Many people will want or need to access your project repository for different purposes. As a matter of security, it is good practice to limit how each person can interact with it. The roles that people can take typically fall into categories such as:
- Analyst: Only needs to be able to read the repository
- Developer: Needs to be able to read and write to the repository
- Maintainer: Needs to be able to edit repository settings
For data scientists, you should have more senior members of staff on the project be maintainers and junior members be developers. This becomes important when deciding who can merge changes into production.
Managing Branches
When developing a project with Git, you will make extensive use of branches that add features / develop functionality. Branches can split into different categories such as:
- main/master: Used for official production releases
- development: Used to bring together features and functionality
- features: What to use when doing code development work
- bugfixes: Used for minor fixes
The main and development branches are special as they are permanent and represent the work that is closest to production. As such special care must be taken with these, namely:
- Ensure they cannot be deleted
- Ensure they cannot be pushed to directly
- They can only be updated via merge requests
- Limit who can merge changes into them
We can and should protect these branches to enforce the above. This is normally the job of project maintainers.
When deciding merge strategies for adding to development / main we need to consider:
- Who is allowed to trigger and approve these merges (specific roles / people?)
- How many approvals are required before a merge is accepted?
- What checks does a branch need to pass to be accepted?
In general we may have less strict controls for updating development vs updating main but it is important to have a consistent strategy in place.
When dealing with feature branches you need to consider:
- What will the branch be called?
- What is the structure to the commit messages?
What is important is to agree as a team the guidelines for naming branches. Some examples could be to name them after a ticket, to have a common list of prefixes to start a branch with or to add a suffix at the end to easily identify the owner. For the commit messages, you may want to use a 3rd party library such as Commitizen to enforce standardisation across the team.
Maintain a Consistent Development Environment
Taking a step back, developing code will require you to:
- Have access to the programming languages software developer kit
- Install 3rd party libraries to develop your solution
Even at this point care must be taken. It is all too common to run into the scenario where solutions that work locally fail when another team member tries to run them. This is caused by inconsistent development environments where:
- Different version of the programming language are installed
- Different versions of the 3rd party library are installed
Ensuring that everyone is developing within the same environment that replicates the production conditions will ensure we have no compatibility issues between developers, the solution will work in production and will eliminate the need for ad-hoc installation of libraries. Some recommendations are:
- Use a requirements.txt / pyproject.toml at a minimum. No pip installing libraries on the fly!
- Look into using docker / containerisation to have fully shippable environments
Without these standardisations in place there is no guarantee that your solution will work when deployed into production
Readme.md
Readme’s are the first thing that are seen when you open a project on your DevOps platform. It gives you an opportunity to provide a high level summary of your project and informs your audience how to interact with it. Some important sections to put in a readme are:
- Project title, description and setup to get people onboarded
- How to run / use so people can use any core functionality and interpret the results
- Contributors / point of contact for people to follow up with
A readme doesn’t need to be extensive documentation of everything relevant to a project, merely a quick start guide. More detailed background, experimental results etc can be hosted somewhere else, such as an internal Wiki like Confluence.
Test, Test And Test Some More!
Anyone can write code but not everyone can write correct and maintainable code. Ensuring that your code is bug free is critical and every precaution should be taken to mitigate this risk. The simplest way to do this is to write tests for whatever code you develop. There are different varieties of tests you can write, such as:
- Unit tests: Test individual components
- Integration tests: Test how the individual components work together
- Regression tests: Test that any new changes haven’t broken existing functionality
Writing a good unit test is reliant on a well written function. Functions should try to adhere to principles such as Do One Thing (DOT) or Don’t Repeat Yourself (DRY) to ensure that you can write clear tests. In general you should test to:
- Show the function working
- Show the function failing
- Trigger any exceptions raised within the function
Another important aspect to consider is how much of your code is tested aka the test coverage. While achieving 100% coverage is the idealised scenario, in practise you may have to settle for less which is okay. This is common when you are coming into an existing project where standards haven’t been properly maintained. The important thing is to start with a coverage baseline and then try and increase that over time as your solution matures. This will involve some technical debt work to get the tests written.
pytest --cov=src/ --cov-fail-under=20 --cov-report term --cov-report xml:coverage.xml --junitxml=report.xml tests
This example pytest invocation both runs the tests and checks that a minimum level of coverage has been attained.
Code Reviews
The single most important part of writing code is having it reviewed and approved by another developer. Having code looked at ensures:
- The code produced answers the original question
- The code meets the required standards
- The code uses an appropriate implementation
Code reviewing data science projects may involve extra steps due to its experimental nature. While this is far for an exhaustive list, some general checks are:
- Does the code run?
- Is it tested sufficiently?
- Are appropriate programming paradigms and data structures used?
- Is the code readable?
- Is it code maintainable and extensible?
def bad_function(keys, values, specifc_key):
for i, key in enumerate(keys):
if key == specific_key:
value[i] = X
return keys, values
The above code snippets highlights a variety of bad habits such as using lists instead of dictionary and no typehints or docstrings. From a data science perspective you will additionally want to check:
- Are notebooks used sparingly and commented appropriately?
- Has the analysis been communicated sufficiently (e.g. graphs labelled, dataframes described etc.)
- Has care been taken when producing models (no data leakage, only using features available at inference etc.)
- Are any artefacts produced and are they stored appropriately?
- Are experiments carried out to a high standard, e.g. set out with a research question, tracked and documented?
- Are there clear next steps from this work?
There will come a time where you move off the project onto other things, and someone else will take over. When writing code you should always ask yourself:
How easy would it be for someone to understand what I have written and be comfortable with maintaining or extending functionality?
Use CICD To Automate The Mundane
As projects grow in size, both in people and code, having checks and standards becomes more and more important. This is typically done through code reviews and can involve tasks like checking:
- Implementation
- Testing
- Test Coverage
- Code Style Standardization
We additionally want to check security concerns such as exposed API keys / credentials or code that is vulnerable to malicious attack. Having to manually check all of these for each code review can quickly become time consuming and could also lead to checks being overlooked. A lot of these checks can be covered by 3rd party libraries such as:
- Black, Flake8 and isort
- Pytest
While this alleviates some of the reviewers work, there is still the problem of having to run these libraries yourself. What would be better is the ability to automate these checks and others so that you no longer have to. This can allow code reviews to be more focussed on the solution and implementation. This is exactly where Continuous Integration / Continuous Deployment (CICD) comes to the rescue.
There are a variety of CICD tools available (GitLab Pipelines, GitHub Actions, Jenkins, Travis etc) that allow the automation of tasks. We could go further and automate tasks such as building environments and even training / deploying models. While CICD can encompasses the whole software development process, I hope I have motivated some useful examples for its use in improving data science projects.
Conclusion
This article concludes a series where I have focussed on how we can reduce the time to value for data science projects by being more rigorous in our code development and experimentation strategies. This final article has covered a wide range of topics related to software development and how they can be applied within a data science context to improve your coding experience. The key areas focussed on were leveraging DevOps platforms to their full potential, maintaining a consistent development environment, the importance of readme’s and code reviews and leveraging automation through CICD. All of these will ensure that you develop software that is robust enough to help support your data science projects and provide value to your business as quickly as possible.
Source link
#Reducing #Time #Data #Science #Projects #Part