Shaving yaks – problem solving in Dev Platform

Posted by on June 19, 2019

Although I usually work in Support Engineering here at FreeAgent, I was recently given the opportunity to spend a six-week cycle working in the Dev Platform team. The technical aspect of the Support Engineer role is what drives me; I love to take a problem, dig into the source code and figure out how to solve it. The work in Dev Platform promised to be even more technical so I jumped at the chance to gain some experience in this team and further my knowledge and skills.

Before this I didn’t know in any great detail about what Dev Platform do on a day-to-day basis. I knew that they’re responsible for looking after Jenkins, which we use for Continuous Integration and Delivery. I also knew that they’re your first port of call if you have any problems with your local development environment and they will help solve any related issues a FreeAgent engineer might come across. In some sense, Dev Platform are like FreeAgent’s internal technical support team. In the next six weeks I was to learn a wealth of knowledge about our testing infrastructure, integration environments, releasing gems and the unix command line, among many other things!

A typical Jenkins pipeline, showing all the stages of the build.

Shaving yaks

I spent my first couple of weeks pairing with Alyssa Ross, an engineer on the team. We were on maintenance duty, responding to any queries from the engineers and solving any problems they might have. The first maintenance task to arrive at our door was apparently simple, but on our journey to complete it we ran into a seemingly endless number of additional issues that needed to be solved first. It was a classic example of Yak Shaving.

And so the story begins…

A deployment of the FreeAgent application had failed due to a network timeout when sending a Slack notification. We shouldn’t fail a deployment in this scenario, so to prevent this from happening we wanted to rescue and log the error instead of letting it fail. The error was happening in Parachute, an internal gem we use for managing deployments.

An example of a Slack notification sent by Parachute, which is where this story began.

After cloning the Parachute repo and running the tests, we encountered a failing test caused by a shell command that, although available on the Jenkins server (running on SmartOS), wasn’t available in the local macOS environment. We got a fix for this ready to go, only to find that the tests were now failing because we’d used the ensure keyword, which didn’t exist in the version of Ruby which Parachute was running (2.3.1). We needed to upgrade Parachute to the latest Ruby before continuing.

Updating the Ruby version was simple enough, but in order to test the change we needed to release a “pre” version of the gem to our internal gem server. Releasing a gem used to be a lengthy manual process but we’ve more recently configured Jenkins to automate it for us. However, this automated process relied on the repository name matching the name of the .gemspec file, which for parachute it didn’t. The next task, therefore, was to improve this behaviour. This involved making a change to the FreeAgent Jenkins pipeline repository, where we keep a large collection of methods to be used in the various repositories in the organisation, to be loaded in from the Jenkinsfile in the project root. This was particularly interesting for me as it introduced me to a new language, Groovy, which the pipeline code is written in.

The Jenkinsfile for Parachute.

This done, we could now attempt to release the gem with the new version of Ruby. However, we found it was now failing due to our old version of Rubocop not recognising the ensure keyword, either. Next task: update Parachute to use the most recent version of Rubocop. This was not as simple as you might think, since there were a load of deprecated rules in rubocop.yml which needed to be removed or updated before it would run without errors. After experimenting for a few hours with a tool called mry which attempts to automate this process, I couldn’t get it to work and decided to cut my losses and edit the file manually. It seemed there was a bug in mry, which would be nice to fix, but that would have to wait for another day.

With the failing test sorted out I could now (finally!) start work on fixing the original problem, which, if you remember, was a Slack notification timeout. I created a new branch, wrote some code along with an automated test to describe the new behaviour and had a pull request ready for review in the space of just half an hour. In completing this seemingly simple task we had also fixed a flaky test, upgraded Parachute’s Ruby and Rubocop versions and improved the gem release automation in the FreeAgent Jenkins pipeline.

The actual fix for the original problem was pretty straight forward.

Being thorough and following best practices

In fixing one small problem we uncovered and solved a series of other problems. Although it wasn’t absolutely necessary to fix these other problems to achieve the end goal, we diligently followed best practices and diagnosed and fixed them properly as we found them, improving everything we touched and leaving the world in a better state. It was a deeply satisfying and rewarding process.

We are currently hiring for Dev Platform. If you have an interest in Continuous Integration, “Yak Shaving” and solving technical problems, check out our careers page.