Python is the programming language of choice for running analysis, building models and running machine learning services in production for the Data Science team at FreeAgent. A key reason we chose Python is the great ecosystem of packages available: NumPy, pandas, SciPy and scikit-learn, deep learning frameworks like TensorFlow and more bespoke options for specific tasks like Click for developing CLIs.
This wealth of options is a great strength of Python but also leads to difficulty. Managing dependencies in Python can be tricky, especially when you have multiple environments to work with (local development, cloud-based jobs and production deployments).
Our setup at FreeAgent
In the second half of last year, we made a concerted effort to rationalise how we manage Python. The goal was to make it easy to get up and running with Python in general or for any of our Data Science projects and to ensure that our environments could be fixed across different members of the team. The toolchain that got us to this point consists of three parts:
The first two tools allow us to ensure that the install of Poetry is effectively isolated from the system Python and that we can easily manage which version of Python we use. While these two tools are important, the real champion of our dependency management system is Poetry itself.
poetry new <package-name> and this creates a folder with the structure of a simple Python package including a
Poetry uses the
pyproject.toml to define the dependencies of your project and you can manually add dependencies (with version constraints) under the
[tool.poetry.dependencies] section. Alternatively, you can specify the dependencies interactively with
poetry add <package> (with version constraints if you want).
poetry add gives you the first glimpse of the true power of Poetry. First, Poetry doesn’t just immediately try to install the requested package and any dependencies as it has its own dependency resolver. This will look for any version of a package that will satisfy the requirements without affecting the currently installed packages. If it cannot find such a package it will give a helpful error message that should let you sort the problem.
By way of an example, let’s suppose I have a simple project that has NumPy version 1.15.4 and Matplotlib version 3.3.4 installed. Now, there is some functionality in the latest SciPy release (1.6.1 at the time of writing) that I really want to use. I run
poetry add scipy~1.6 add get the following message:
SolverProblemError Because no versions of scipy match >1.6,<1.6.1 || >1.6.1,<1.7 and scipy (1.6.0) depends on numpy (>=1.16.5), scipy (>=1.6,<1.6.1 || >1.6.1,<1.7) requires numpy (>=1.16.5). And because scipy (1.6.1) depends on numpy (>=1.16.5), scipy (>=1.6,<1.7) requires numpy (>=1.16.5). So, because example-env depends on both numpy (~1.15) and scipy (~1.6), version solving failed.
This immediately tells me what went wrong and how I can fix it. I need to update the NumPy version constraint. I can do this with
poetry add numpy~1.20 to get up to date and check that the tests still pass. Then
poetry add scipy~1.6 works without a hitch!
Part of the poetry add process is writing the solved version constraints to a lockfile,
poetry.lock. This file should be checked into version control and it ensures that the same versions of packages are installed across environments. Someone checking out the project for the first time just needs to run
poetry install and all the packages, and the same versions, will be installed. This is the key to why Poetry is so good across multiple environments.
A corollary of the dependency resolver and lockfile is that Poetry is a great way to keep transitive dependencies in check. If you install the dependencies through Poetry then an upstream change to a dependency of your dependency cannot cause unexpected issues. Poetry also has a tool to help you understand the dependency graph of your project so you can understand why each package is installed:
poetry show --tree. Here is an example output from the simple project with NumPy, SciPy and Matplotlib installed:
❯ poetry show --tree matplotlib 3.3.4 Python plotting package ├── cycler >=0.10 │ └── six * ├── kiwisolver >=1.0.1 ├── numpy >=1.15 ├── pillow >=6.2.0 ├── pyparsing >=2.0.3,<2.0.4 || >2.0.4,<2.1.2 || >2.1.2,<2.1.6 || >2.1.6 └── python-dateutil >=2.1 └── six >=1.5 numpy 1.20.1 NumPy is the fundamental package for array computing with Python. scipy 1.6.1 SciPy: Scientific Library for Python └── numpy >=1.16.5
So, why Poetry?
We want to ensure that our Python environments are consistent across different data scientists’ local development environments, our cloud-based jobs and any production environments that we have. While this is achievable in many different ways, Poetry makes it simple. Since rolling out Poetry, we are now using it in our Bank Transaction Classification work and across multiple ad hoc investigations. The response from the team has been positive – it has never been easier to move between different projects!