A 12-step guide to AWS cost optimisation

Posted by on June 16, 2022

This article outlines the pragmatic approach that we’ve followed here at FreeAgent in our first 18 months of using AWS to increase our cost efficiency.

Using this approach, we’ve already cut our AWS spend by 50%, and we estimate we can save another 30% a year by implementing further efficiencies. Here are 12 things we’ve learned along the way.

Our strategy

1. Don’t optimise for cost too early

We fully migrated from colocation facilities to the cloud on 1st December 2020. You can read more about that here

When we designed our initial architecture and migration, it was just that – initial. We didn’t plan for the most cost-effective solution from the outset, instead prioritising security, reliability and performance.

Why? Complexity and data

Firstly, cost optimisation can add additional complexity and risk to something that’s already complex and fraught with risk. Minimise risk and complexity; they are enemies of success.

Secondly, data – or lack thereof. If you are starting with a new service, migrating a service or making large architectural changes, it can be difficult to predict which cost efficiency efforts will have the greatest impact when you have zero data on how the workload actually performs in practice.

Making decisions with data and real world usage metrics is far easier than trying to predict the usage and how that will materialise on an actual AWS bill.

I’m afraid you can’t reference the often mis-quoted “premature optimisation is the root all of evil” as a reason for avoiding forecasting costs for your workload, or making sound design decisions to ensure you can meet budget constraints.

Instead it’s a nudge to not get caught optimising the solution by adding architectural complexity, or expending engineering effort that may not be the most impactful to save costs. Without data you might just spend time focusing on the wrong part of the puzzle.For example, focusing on ensuring you can upgrade to an RDS Aurora version that supports slightly cheaper Graviton-based instances, when in reality your largest cost is from paying for Aurora IOPS. Or RDS Snapshots (which AWS have yet to implement cheaper cold storage for…)

2. Have meaningful measures

Ensure you track spend and can differentiate growth from changes in cost efficiency. We don’t use anything fancy for this. We make heavy use of multiple AWS accounts to segregate teams and workloads which also helps with budget management at a high level. We also make use of, yes, spreadsheets.

Define KPIs to measure your effectiveness – the specific KPI might be different for each workload. Some examples we use:

  • Cost per user
  • Cost per CI run
  • Cost per engineer

3. Process not project

Cost optimisation is a process, not a project to be completed. AWS constantly releases new services and features. Aim to implement cost optimisation as part of your ‘business as usual’ operations. 

Don’t do this centrally – ensure anyone responsible for a workload has a process for reviewing cost efficiency and identifying potential actions. Workload owners know the most about their specific workload and are the most likely to be able to spot efficiencies and implement them safely. 

Keep abreast of changes, monitor what’s new at AWS and keep an eye out for what you can take advantage of.

4. Create a prioritised list and review your bill

This is simple, but start with the most expensive workloads and services. Any effort expended is likely to result in the largest saving. Work down the list systematically looking for opportunities.

We maintain a spreadsheet of “cost saving opportunities” and consider the engineering investment, other business priorities and impact before deciding which to tackle and when.

Practical steps

5. Right sizing

It’s important to monitor to ensure that your resources are sized correctly for your workload. Anywhere you can select from an instance type needs review: RDS, Redshift, EC2 instances, ECS Task sizes. AWS usually provides the metrics in Cloudwatch that you need to decide whether an instance is right sized.

As part of right sizing, check whether you are on the latest instance generation – later generations of instances usually provide greater cost efficiency.

In our cases, we performed some application optimisation where it made sense to allow us to downsize instances too. For example, RDS Aurora was underutilised the majority of the time with some large inefficient tasks causing resource spikes. We were able to identify the potential saving and then justify the effort to make application changes with a solid cost/benefit analysis.

While EC2 instances/Compute are the obvious choice, don’t forget about EBS volumes. Are they sized correctly? Are you using the latest generation, i.e. gp3 over gp2 storage options?

6. Turn it off

This is AWS after all. Are there workloads or parts of workloads that can be turned off? Have they become defunct over time, or have you leaked EBS volumes?

Do those staging and integration environments need to run 24×7? We turn on/off integration environments as required, for example. This isn’t a huge saving for us when looking down the bill, but in our case it was low effort to implement. As with everything, this is workload dependent. 

7. EC2 Reserved Instances/Savings Plans

If you’ve completed right sizing and have a workload that allows you to commit to a minimum amount of compute for at least one year, look to make use of Reserved Instances – or now EC2 Savings Plans – which allow you to make use of AWS discounts for committing to a minimum spend over one to three years. It’s not  just EC2 – many services provide the option of RI’s, RDS and Elasticache, for example.

8. S3

We’ve called this out specifically because S3 is such a common service. It’s easy to start using the service, which is cheap at low volumes, and see it creep up over time without realising there are cost-saving options. We make use of S3 lifecycle rules and storage classes, ensuring data is removed that’s no longer needed, and using more cost-effective storage classes where we can, such as Standard-IA, Glacier etc.

Keep an eye out for AWS Data Transfer Costs – these can add up for transferring large amounts of data out of AWS, or making many small read/write requests.

9. Savings Plans

It’s a good idea to check the AWS docs carefully, possibly twice, because while Reserved Instances, EC2 Savings Plans and Savings Plans are similar, they are subtly different.

After we had made use of Reserved Instances where we could, we implemented Savings Plans. This allowed us to cover AWS ECS Fargate and AWS Lambda with discounts for committed spend. As a bonus, Savings Plans apply to all types of compute so they are easier to make use of over a longer duration.

Some services – Cloudfront and SageMaker, for example – have their own savings plans, so look out for these.

10. EC2/Fargate Spot

We make heavy use of EC2 Spot for certain workloads – providing compute for automated testing for example. EC2 Spot can provide huge savings but can require more effort to implement.

You have to be OK with having your workload interrupted. You can make interruptions less frequent by careful use of the options AWS provide, but by definition spot instances can and will be interrupted and will potentially have zero capacity available in some cases.

The savings can be substantial enough that it’s often worth some ‘out of the box’ thinking to see if you can make your application resistant to spot interruption and capacity fluctuations. 

For our CI workload, we’re using EC2 Fleet, with multiple instance types, multiple AZs and a capacity-optimised selection, which is proving much more reliable than our initial naive implementation of spot instances.

11. Traffic and NAT Gateways

Traffic is harder to optimise for and something to start considering if you start to see this become a larger line item on your bill. Careful selection of how clusters are deployed, cross AZ traffic and cross region traffic all come into play. We’re fortunate that our workloads are not traffic heavy, so this hasn’t been something we’ve had to heavily optimise for.

NAT Gateway costs are something to look out for, though – we managed to save a not insubstantial amount by implementing VPC Endpoints for AWS Elastic Container Registry, so container downloads didn’t traverse the NAT Gateway and incur extra charges.

Look out for traffic that uses NAT gateways, and if it’s traffic destined for AWS services, take a look at the cost/benefit of implementing AWS VPC Endpoints for those services.

12. Autoscaling

This is often quoted but is something we have yet to implement for our production workload. The reason is we simply had lower-hanging fruit to pick in our cost optimisation process. But it’s definitely something that we’ll be looking towards soon.

In summary

There are no magic bullets, and no third-party wonder tools. However, what has worked for us is being able to measure progress (KPIs) and have access to people who have a thorough understanding of the workload in question.

Take a systematic approach to work through the workloads and services from most to least expensive. For each one, apply the basic options that AWS makes available to increase cost efficiency.

Don’t implement everything for every workload, assess impact vs effort, and don’t forget the opportunity lost by venturing on a large cost optimisation effort.

In full flow: moving from Jenkins to Actions – Part 1

Posted by on June 10, 2022

At FreeAgent a recent project to move our Continuous Integration/Continuous Delivery (CI/CD) workflows from Jenkins to GitHub Actions has brought some real benefits.

In this post we’ll cover the background of our CI/CD pipelines, why we wanted to change how they run, and how we decided on GitHub Actions. In the next post we’ll cover how we handled the migration, and how we solved the challenges we encountered.

But let’s start with the outcomes of the project:

  • Our slowest deployments got nine minutes faster, as our p95 dropped from 25 to 16 minutes
  • 90% reduction of our monthly infrastructure spend for running tests
  • More reliable test runners, with far less instance interruption and no errors caused by local state
  • Freed up the team to work on other important tasks

We achieved this by migrating all our CI/CD builds from Jenkins pipelines to GitHub Action Workflows, and using AWS EC2 Fleets to launch fast, ephemeral EC2 instances to run our builds.

Why we decided to make the move

To put the project in context, the main FreeAgent application is a Ruby on Rails monolith with 58,000+ tests (unit, integration and feature specs). We run all tests every time a branch is pushed, and again whenever a PR is merged and deployed. This would be way too slow to run on a single machine, so we heavily parallelise and optimise the running of the test suite. This takes the test suite time from around nine hours to typically around three to four minutes.

Outgrowing Jenkins

We’d been running the test suite with Jenkins pipelines for several years. It originally ran on dedicated, on-premises hosts in our data centre. In 2020, the Developer Platform team migrated Jenkins to AWS, as part of FreeAgent’s decision to move our infrastructure from co-located data centres to AWS. Once we’d finished, we had some time to stop and consider if Jenkins was still the best choice for us.

Jenkins did a lot of stuff we really liked:

  • Running builds when changes were made to code
  • Summarising test reports so that trends, like test count and timing changes, were easy to spot over time
  • Keeping developers informed on the status of their builds via Slack messages
  • Distributing jobs over multiple agents
  • Keeping dependencies cached on agents for quick reuse, saving loads of time in builds that would be spent on downloading and installing

But Jenkins also did some things a lot less well:

  • Recovering from the failure of a single agent
    • Made worse by our test parallelisation, other agents would pick up the work but the build’s overall status would be FAILURE
    • Recovery involved manually re-running the entire pipeline, and re-running all the tests
  • Persistent state on agents
    • Sometimes a dependency would be poisoned (wrong version or partial installation). Jenkins would schedule many jobs to run on it, and they would all fail
    • We would identify poisoned agents by reading build logs manually, and remove them from Jenkins manually
  • Single point of failure
    • A single Jenkins master instance was responsible for running builds for all repositories
    • When it hit resource limits, all builds would stop running for all projects, and would require immediate urgent intervention from the Developer Platform team
  • Massive pain to test and deploy newer versions of Jenkins
    • Simple to smoke test new version on a staging server
    • Still extremely hard to test without actually performing full builds at the same time as the existing server
  • Different technology stack to the company’s main competency (Ruby)
    • Our team had to be experts on optimising Java VM memory options and Groovy scripting

In addition, our implementation also caused some further complications:

  • Managing multiple workers
    • We used the Amazon EC2 Jenkins plugin to integrate Jenkins and AWS EC2 Spot instances (namely c5.4xlarge). Instances would occasionally ‘escape’ and continue running outside of Jenkins’ oversight
    • To mitigate, we ran a periodic clean-up job to detect and shut down the escaped instances
  • Restrictions on the EC2 instance types we could use
    • Amazon EC2 Jenkins plugin dictated which instance types our agent could use, via its dependency on a specific version of the AWS SDK. The c5.4xlarge type was the most suitable for our agents. To keep costs down, we ran our agents on Spot instances
    • Retail sales events would cause dramatic increases in the use of On Demand c5.4xlarges. Our Jenkins agents would unexpectedly disappear as their instances were terminated when they were reallocated from the Spot to the On Demand market
    • We needed to reconfigure Jenkins agents to switch to using On Demand instances. This involved a cost increase and we would need to remember to switch back in January
  • Difficult to track changes to system settings
    • Lots of options (for example, AMI IDs for our EC2 agents) had to be manually configured, without a code review/sign-off
    • We’d record a list of manual changes in our company wiki in case we needed to roll back

These disadvantages generated a lot of boring, day-to-day issues. Many valuable hours were spent firefighting the issues, rather than building and improving the development platform for our engineering teams.

Most of these issues arose because we ran an older version of Jenkins and its plugins, and upgrading the system would see a lot of the issues disappear. So, we considered what would make us more confident in our Jenkins upgrades, and decided the best plan was to take large maintenance windows for upgrades and testing. We would freeze all repositories, preventing jobs from running and deployments from happening, until we had upgraded and verified that the most important jobs were working as expected.

If we only had to do this once, then we’d have scheduled the window and done it as a one-off. However, as we wanted to keep Jenkins and its plugins up to date in the future, and as they frequently release new versions, we would have to schedule a window every few months. That would mean blocking developers from committing and deploying. 

Why did we choose GitHub Actions?

What we really wanted was infrastructure that provided a pool of workers and kept them reasonably up to date, and a way to schedule jobs for them, that the Developer Platform team wouldn’t be responsible for.

We’d previously used GitHub Actions to run our static code checks on our main application. It worked just fine and we considered what we needed to do to run the entire FreeAgent test suite on Actions. There were a couple of other considerations:

  • We were already paying for GitHub and using it for Gem hosting
  • GitHub’s documentation made it easier for developers to learn how to write Action workflows in YAML than Jenkins Pipelines in Groovy
  • GitHub Actions Marketplace was a reasonable alternative to Jenkins Plugins because:
    • We found it easier to read and understand the source for an Action than a Plugin
    • There is a large and growing plugin ecosystem from authors solving their own challenges with their builds and test runs

We decided to go ahead with GitHub Actions, and try to find blockers that would prevent us from moving over from Jenkins. We liked how easy it was to set up a proof of concept, and our existing build metrics meant we could be reasonably confident about how much it would cost.

In the next post we’ll share how we broke up the migration work, the challenges we encountered and how we solved them, and how we achieved the reduction in costs we outlined earlier.

Daniel Holz is a Software Engineer on the Developer Platform team at FreeAgent, the UK online accounting software made specifically for freelancers, small business owners and their accountants.

How we structure our data teams at FreeAgent

Posted by on June 3, 2022

Since joining FreeAgent back in April I’ve been both impressed and interested with how the Data organisation is structured. I’ve come from an enterprise world where you have lots of Data Engineers, a team of dedicated Data Architects and a separate Business Intelligence org. A few things that immediately struck me at FreeAgent were: 

  • No one has the title ‘Data Engineer’
  • Data Analytics are part of the Engineering org
  • No one has the title ‘Data/Platform/Solutions Architect’

What I want to talk about in this post is why, for an organisation like FreeAgent, these are all great features of a Data team. One quick disclaimer is that this post represents the current state at the time of writing (June 2022). Things may well change in the future!

The organisation

To set the scene, it’s worth introducing our current Engineering organisation. Product & Engineering at FreeAgent is split into Product Engineering and Platform Engineering. The Data teams sit within Platform Engineering, with Data Science and Analytics grouped together as well as a separate Data Platform team, which sits within the Architecture group. A rough diagram of this structure is shown below.

This post will focus on the three Data teams: Data Science, Analytics and Data Platform.

Engineering

Given the absence of Data Engineers at FreeAgent, you might wonder who does the Data Engineering? Who extracts data, transforms it, and loads it where it needs to be? For us, this work is shared between the Data Science, Analytics and Data Platform teams, as I’ll describe. 

There are several data sources we want to ingest, the most important being the FreeAgent app. Ingesting these data is owned by Data Platforms, with raw files landing in our S3 Data Lake. From here a Glue Crawler populates the Data Catalog with information about the data in S3, making it queryable with Athena.

Our Analytics team uses Matillion to access the Data Lake and transform raw data into a more analytics-ready form in Redshift. We then use Looker to visualise these data in Redshift. Looker also provides its own tools for transformation as part of LookML.

Data Science also loads data into the Data Lake when extracting training data for models from the FreeAgent app, as well as using Matillion and Looker.

We can then see that the extraction of data is handled by our Data Platform team. Data Science and, primarily, Analytics then build the transformations in the middle and load data into Looker. By making use of tools like Matillion, Redshift and Looker, we are able to carry out all this ETL work without dedicated Data Engineers. 

One aspect of our Data org that supports this approach is including Analytics within Engineering. This makes sense given we expect our Analytics team to have Data Engineering skills. It also reflects the importance of engineering to Analytics, as demonstrated by the emergence of Analytics Engineering in recent years.

Analytics

Engineering a Data Platform is great, but it’s not an end in itself. We need to do something with the data. This is where Analytics and Data Science enter the mix. As well as building data pipelines, our Data Analysts also analyse data using Looker, R, and Python.

Another important tool for analytics is Looker, where business users are able to self-serve insights. Product teams can maintain their own dashboards as well as using dashboards maintained by Analytics.

One thing that really impressed me early on at FreeAgent was seeing a dashboard a product team had created. I was tasked with building a model using some data that the team produces and they already had a dashboard showing the distributions, volume and quality of the data.

Architecture

As I’ve already noted, we don’t have dedicated Architects in our Data org. Instead, all our Data teams have a stake in our platform architecture. As the name suggests, our Data Platform team are the architects of the platform other teams build upon. Data Platform decides how our Data Lake should be set up or the best way for Matillion to access it. The modelling of Data within our Warehouse is owned by Analytics, as part of their engineering work. Data Science is also empowered to work out how we want to do Machine Learning. 

While establishing ownership is important, all these teams ultimately work closely together to decide how to set things up, with input from the broader Engineering community. For example, Data Platform might be responsible for building our Data Lake, but Analytics and Data Science also contribute to how we should build it, as two of the biggest users. As Analytics are responsible for building a Data Warehouse downstream of the Data Lake, they have a stake in how it’s built.

Machine learning

A post on Data wouldn’t be complete without mentioning machine learning. We recently hired a new Data Scientist (👋) and have talked in a previous post about what we look for in a Data Scientist

In Data Science, we work on building customer-facing machine learning models. Our primary focus recently has been our model to categorise customers’ bank transactions. Data Science owns the whole process, from the initial analysis to understand a problem through to building the production model in AWS. We even make changes to the FreeAgent app to interact with our models! 

This approach to doing machine learning, where the same people build a model and put it in production, has become popular in recent years under the heading Machine Learning Engineering, as I’ve discussed elsewhere

The benefits

Now I’ve outlined who does what, I want to talk about why I think this is a great way to organise things. 

One key benefit is that the same people design and build a given component of our data platform. This creates a sense of ownership as well as adding variety and challenge to people’s roles. Simply building or working with something you’ve had no part in designing becomes dull after a while. Hands-on experience building things also makes you a better designer. 

The importance of breadth is even clearer for Data Analysts given their role as Data Evangelists within the business. Analysts work directly with the business to make us more data-driven. This gives them a unique insight into how we ought to build our data platform to maximise impact. Analytics teams who have to rely on another team, perhaps in a different org, for data transformations will either be perennially frustrated or will build a shadow data warehouse to serve their needs. Enabling analysts to do engineering, as we do, avoids these issues. This approach also empowers self-serve – if your Data Warehouse is set up with analytics in mind, people will find it a lot easier to do analytics themselves. 

The final benefit of empowering data teams to do things for themselves is that you minimise hand-offs. The more end-to-end you allow a team to be, the less time you spend managing handovers and integrations between teams.

The limits

All these benefits sound great and the way we’ve structured our Data teams makes a lot of sense for FreeAgent. However, it’s worth considering the limitations of this approach and how they might be addressed.

Large enterprises with a wide range of products, data sources and technologies will likely find it difficult to do without dedicated architects. Our Analytics team looks after Data Modelling as part of doing analytics. This approach may become untenable for organisations with more complex data, requiring dedicated architects instead. Similarly, our Data Platform team design, build and maintain our data platform. Larger organisations with more complex data platforms may, again, need people dedicated to designing those platforms. However, we shouldn’t lose sight of the benefits that come from involving engineers and analysts in architectural decisions.

Another aspect of our approach that may not scale to larger organisations is the lack of dedicated Data Engineers. We are able to do our data transformations with tools accessible to Data Analysts and Data Scientists. If we had a larger volume of data in a wider range of formats we might need to use more specialised data engineering tools. It could be unreasonable to expect Data Analysts and Data Scientists to use these more specialised tools, which are better suited to dedicated Data Engineers. One interesting question to consider is how the latest generation of tools affects the point at which you need Data Engineers. Do cloud-native data warehouses like Redshift and BigQuery, combined with accessible transformation tools (Matillion, dbt), mean you can go a lot further without dedicated Data Engineers?

I feel like organisations of any size could include the Analytics Engineering and Machine Learning Engineering elements of our approach. Empowering analysts to do more engineering for themselves with tools that encourage software engineering best practices is always a plus. A quick search of ‘Analytics Engineering’ will show you the range of organisations adopting this approach. Equally, Machine Learning Engineering is here to stay as it just makes sense given the tooling and expectations for machine learning today.

Conclusions

This post has introduced Data at FreeAgent and the different teams we have. To recap, we have three teams:

  • Analytics maintain our Data Warehouse and Looker instance, do analyses and help the business become more data-driven
  • As the name suggests, Data Platform look after our data platform, the tools we use and how they all work together
  • Data Science own the end-to-end production of Machine Learning models

As much as possible, teams own, design and build the tools that they use. This result is ‘T-shaped’ teams with breath outside of their core expertise. For example, in Data Science we have core expertise in machine learning while also being involved in the data platform to the left of our models and the serving infrastructure to the right. Taking such an approach gives teams the opportunity to drive their own impact within the organisation as well as creating varied and interesting roles for individuals.