Managing Python dependencies across multiple Data Science projects with Poetry

Posted by on March 5, 2021

Exciting Python code on screen

Python is the programming language of choice for running analysis, building models and running machine learning services in production for the Data Science team at FreeAgent. A key reason we chose Python is the great ecosystem of packages available: NumPy, pandas, SciPy and scikit-learn, deep learning frameworks like TensorFlow and more bespoke options for specific tasks like Click for developing CLIs.

This wealth of options is a great strength of Python but also leads to difficulty. Managing dependencies in Python can be tricky, especially when you have multiple environments to work with (local development, cloud-based jobs and production deployments).

Our setup at FreeAgent

In the second half of last year, we made a concerted effort to rationalise how we manage Python. The goal was to make it easy to get up and running with Python in general or for any of our Data Science projects and to ensure that our environments could be fixed across different members of the team. The toolchain that got us to this point consists of three parts:

The first two tools allow us to ensure that the install of Poetry is effectively isolated from the system Python and that we can easily manage which version of Python we use. While these two tools are important, the real champion of our dependency management system is Poetry itself.

Using Poetry

Right away, Poetry offers a very intuitive interface similar to Bundler and Yarn, which are used elsewhere in FreeAgent to manage Ruby and javascript dependencies. You can construct a new project that uses poetry with poetry new <package-name> and this creates a folder with the structure of a simple Python package including a pyproject.toml.

Poetry uses the pyproject.toml to define the dependencies of your project and you can manually add dependencies (with version constraints) under the [tool.poetry.dependencies] section. Alternatively, you can specify the dependencies interactively with poetry add <package> (with version constraints if you want).

Using poetry add gives you the first glimpse of the true power of Poetry. First, Poetry doesn’t just immediately try to install the requested package and any dependencies as it has its own dependency resolver. This will look for any version of a package that will satisfy the requirements without affecting the currently installed packages. If it cannot find such a package it will give a helpful error message that should let you sort the problem.

By way of an example, let’s suppose I have a simple project that has NumPy version 1.15.4 and Matplotlib version 3.3.4 installed. Now, there is some functionality in the latest SciPy release (1.6.1 at the time of writing) that I really want to use. I run poetry add scipy~1.6 add get the following message:

SolverProblemError
  Because no versions of scipy match >1.6,<1.6.1 || >1.6.1,<1.7
   and scipy (1.6.0) depends on numpy (>=1.16.5), scipy (>=1.6,<1.6.1 || >1.6.1,<1.7) requires numpy (>=1.16.5).
  And because scipy (1.6.1) depends on numpy (>=1.16.5), scipy (>=1.6,<1.7) requires numpy (>=1.16.5).
  So, because example-env depends on both numpy (~1.15) and scipy (~1.6), version solving failed.

This immediately tells me what went wrong and how I can fix it. I need to update the NumPy version constraint. I can do this with poetry add numpy~1.20 to get up to date and check that the tests still pass. Then poetry add scipy~1.6 works without a hitch!

Part of the poetry add process is writing the solved version constraints to a lockfile, poetry.lock. This file should be checked into version control and it ensures that the same versions of packages are installed across environments. Someone checking out the project for the first time just needs to run poetry install and all the packages, and the same versions, will be installed. This is the key to why Poetry is so good across multiple environments.

A corollary of the dependency resolver and lockfile is that Poetry is a great way to keep transitive dependencies in check. If you install the dependencies through Poetry then an upstream change to a dependency of your dependency cannot cause unexpected issues. Poetry also has a tool to help you understand the dependency graph of your project so you can understand why each package is installed: poetry show --tree. Here is an example output from the simple project with NumPy, SciPy and Matplotlib installed: 

❯ poetry show --tree
matplotlib 3.3.4 Python plotting package
├── cycler >=0.10
│   └── six *
├── kiwisolver >=1.0.1
├── numpy >=1.15
├── pillow >=6.2.0
├── pyparsing >=2.0.3,<2.0.4 || >2.0.4,<2.1.2 || >2.1.2,<2.1.6 || >2.1.6
└── python-dateutil >=2.1
    └── six >=1.5
numpy 1.20.1 NumPy is the fundamental package for array computing with Python.
scipy 1.6.1 SciPy: Scientific Library for Python
└── numpy >=1.16.5

So, why Poetry?

We want to ensure that our Python environments are consistent across different data scientists’ local development environments, our cloud-based jobs and any production environments that we have. While this is achievable in many different ways, Poetry makes it simple. Since rolling out Poetry, we are now using it in our Bank Transaction Classification work and across multiple ad hoc investigations. The response from the team has been positive – it has never been easier to move between different projects!

Leave a reply

Your email address will not be published.