In full flow: moving from Jenkins to Actions – Part 1

Posted by on June 10, 2022

At FreeAgent a recent project to move our Continuous Integration/Continuous Delivery (CI/CD) workflows from Jenkins to GitHub Actions has brought some real benefits.

In this post we’ll cover the background of our CI/CD pipelines, why we wanted to change how they run, and how we decided on GitHub Actions. In the next post we’ll cover how we handled the migration, and how we solved the challenges we encountered.

But let’s start with the outcomes of the project:

  • Our slowest deployments got nine minutes faster, as our p95 dropped from 25 to 16 minutes
  • 90% reduction of our monthly infrastructure spend for running tests
  • More reliable test runners, with far less instance interruption and no errors caused by local state
  • Freed up the team to work on other important tasks

We achieved this by migrating all our CI/CD builds from Jenkins pipelines to GitHub Action Workflows, and using AWS EC2 Fleets to launch fast, ephemeral EC2 instances to run our builds.

Why we decided to make the move

To put the project in context, the main FreeAgent application is a Ruby on Rails monolith with 58,000+ tests (unit, integration and feature specs). We run all tests every time a branch is pushed, and again whenever a PR is merged and deployed. This would be way too slow to run on a single machine, so we heavily parallelise and optimise the running of the test suite. This takes the test suite time from around nine hours to typically around three to four minutes.

Outgrowing Jenkins

We’d been running the test suite with Jenkins pipelines for several years. It originally ran on dedicated, on-premises hosts in our data centre. In 2020, the Developer Platform team migrated Jenkins to AWS, as part of FreeAgent’s decision to move our infrastructure from co-located data centres to AWS. Once we’d finished, we had some time to stop and consider if Jenkins was still the best choice for us.

Jenkins did a lot of stuff we really liked:

  • Running builds when changes were made to code
  • Summarising test reports so that trends, like test count and timing changes, were easy to spot over time
  • Keeping developers informed on the status of their builds via Slack messages
  • Distributing jobs over multiple agents
  • Keeping dependencies cached on agents for quick reuse, saving loads of time in builds that would be spent on downloading and installing

But Jenkins also did some things a lot less well:

  • Recovering from the failure of a single agent
    • Made worse by our test parallelisation, other agents would pick up the work but the build’s overall status would be FAILURE
    • Recovery involved manually re-running the entire pipeline, and re-running all the tests
  • Persistent state on agents
    • Sometimes a dependency would be poisoned (wrong version or partial installation). Jenkins would schedule many jobs to run on it, and they would all fail
    • We would identify poisoned agents by reading build logs manually, and remove them from Jenkins manually
  • Single point of failure
    • A single Jenkins master instance was responsible for running builds for all repositories
    • When it hit resource limits, all builds would stop running for all projects, and would require immediate urgent intervention from the Developer Platform team
  • Massive pain to test and deploy newer versions of Jenkins
    • Simple to smoke test new version on a staging server
    • Still extremely hard to test without actually performing full builds at the same time as the existing server
  • Different technology stack to the company’s main competency (Ruby)
    • Our team had to be experts on optimising Java VM memory options and Groovy scripting

In addition, our implementation also caused some further complications:

  • Managing multiple workers
    • We used the Amazon EC2 Jenkins plugin to integrate Jenkins and AWS EC2 Spot instances (namely c5.4xlarge). Instances would occasionally ‘escape’ and continue running outside of Jenkins’ oversight
    • To mitigate, we ran a periodic clean-up job to detect and shut down the escaped instances
  • Restrictions on the EC2 instance types we could use
    • Amazon EC2 Jenkins plugin dictated which instance types our agent could use, via its dependency on a specific version of the AWS SDK. The c5.4xlarge type was the most suitable for our agents. To keep costs down, we ran our agents on Spot instances
    • Retail sales events would cause dramatic increases in the use of On Demand c5.4xlarges. Our Jenkins agents would unexpectedly disappear as their instances were terminated when they were reallocated from the Spot to the On Demand market
    • We needed to reconfigure Jenkins agents to switch to using On Demand instances. This involved a cost increase and we would need to remember to switch back in January
  • Difficult to track changes to system settings
    • Lots of options (for example, AMI IDs for our EC2 agents) had to be manually configured, without a code review/sign-off
    • We’d record a list of manual changes in our company wiki in case we needed to roll back

These disadvantages generated a lot of boring, day-to-day issues. Many valuable hours were spent firefighting the issues, rather than building and improving the development platform for our engineering teams.

Most of these issues arose because we ran an older version of Jenkins and its plugins, and upgrading the system would see a lot of the issues disappear. So, we considered what would make us more confident in our Jenkins upgrades, and decided the best plan was to take large maintenance windows for upgrades and testing. We would freeze all repositories, preventing jobs from running and deployments from happening, until we had upgraded and verified that the most important jobs were working as expected.

If we only had to do this once, then we’d have scheduled the window and done it as a one-off. However, as we wanted to keep Jenkins and its plugins up to date in the future, and as they frequently release new versions, we would have to schedule a window every few months. That would mean blocking developers from committing and deploying. 

Why did we choose GitHub Actions?

What we really wanted was infrastructure that provided a pool of workers and kept them reasonably up to date, and a way to schedule jobs for them, that the Developer Platform team wouldn’t be responsible for.

We’d previously used GitHub Actions to run our static code checks on our main application. It worked just fine and we considered what we needed to do to run the entire FreeAgent test suite on Actions. There were a couple of other considerations:

  • We were already paying for GitHub and using it for Gem hosting
  • GitHub’s documentation made it easier for developers to learn how to write Action workflows in YAML than Jenkins Pipelines in Groovy
  • GitHub Actions Marketplace was a reasonable alternative to Jenkins Plugins because:
    • We found it easier to read and understand the source for an Action than a Plugin
    • There is a large and growing plugin ecosystem from authors solving their own challenges with their builds and test runs

We decided to go ahead with GitHub Actions, and try to find blockers that would prevent us from moving over from Jenkins. We liked how easy it was to set up a proof of concept, and our existing build metrics meant we could be reasonably confident about how much it would cost.

In the next post we’ll share how we broke up the migration work, the challenges we encountered and how we solved them, and how we achieved the reduction in costs we outlined earlier.

Daniel Holz is a Software Engineer on the Developer Platform team at FreeAgent, the UK online accounting software made specifically for freelancers, small business owners and their accountants.

Leave a reply

Your email address will not be published.