This article outlines the pragmatic approach that we’ve followed here at FreeAgent in our first 18 months of using AWS to increase our cost efficiency.
Using this approach, we’ve already cut our AWS spend by 50%, and we estimate we can save another 30% a year by implementing further efficiencies. Here are 12 things we’ve learned along the way.
Our strategy
1. Don’t optimise for cost too early
We fully migrated from colocation facilities to the cloud on 1st December 2020. You can read more about that here
When we designed our initial architecture and migration, it was just that – initial. We didn’t plan for the most cost-effective solution from the outset, instead prioritising security, reliability and performance.
Why? Complexity and data
Firstly, cost optimisation can add additional complexity and risk to something that’s already complex and fraught with risk. Minimise risk and complexity; they are enemies of success.
Secondly, data – or lack thereof. If you are starting with a new service, migrating a service or making large architectural changes, it can be difficult to predict which cost efficiency efforts will have the greatest impact when you have zero data on how the workload actually performs in practice.
Making decisions with data and real world usage metrics is far easier than trying to predict the usage and how that will materialise on an actual AWS bill.
I’m afraid you can’t reference the often mis-quoted “premature optimisation is the root all of evil” as a reason for avoiding forecasting costs for your workload, or making sound design decisions to ensure you can meet budget constraints.
Instead it’s a nudge to not get caught optimising the solution by adding architectural complexity, or expending engineering effort that may not be the most impactful to save costs. Without data you might just spend time focusing on the wrong part of the puzzle.For example, focusing on ensuring you can upgrade to an RDS Aurora version that supports slightly cheaper Graviton-based instances, when in reality your largest cost is from paying for Aurora IOPS. Or RDS Snapshots (which AWS have yet to implement cheaper cold storage for…)
2. Have meaningful measures
Ensure you track spend and can differentiate growth from changes in cost efficiency. We don’t use anything fancy for this. We make heavy use of multiple AWS accounts to segregate teams and workloads which also helps with budget management at a high level. We also make use of, yes, spreadsheets.
Define KPIs to measure your effectiveness – the specific KPI might be different for each workload. Some examples we use:
- Cost per user
- Cost per CI run
- Cost per engineer
3. Process not project
Cost optimisation is a process, not a project to be completed. AWS constantly releases new services and features. Aim to implement cost optimisation as part of your ‘business as usual’ operations.
Don’t do this centrally – ensure anyone responsible for a workload has a process for reviewing cost efficiency and identifying potential actions. Workload owners know the most about their specific workload and are the most likely to be able to spot efficiencies and implement them safely.
Keep abreast of changes, monitor what’s new at AWS and keep an eye out for what you can take advantage of.
4. Create a prioritised list and review your bill
This is simple, but start with the most expensive workloads and services. Any effort expended is likely to result in the largest saving. Work down the list systematically looking for opportunities.
We maintain a spreadsheet of “cost saving opportunities” and consider the engineering investment, other business priorities and impact before deciding which to tackle and when.
Practical steps
5. Right sizing
It’s important to monitor to ensure that your resources are sized correctly for your workload. Anywhere you can select from an instance type needs review: RDS, Redshift, EC2 instances, ECS Task sizes. AWS usually provides the metrics in Cloudwatch that you need to decide whether an instance is right sized.
As part of right sizing, check whether you are on the latest instance generation – later generations of instances usually provide greater cost efficiency.
In our cases, we performed some application optimisation where it made sense to allow us to downsize instances too. For example, RDS Aurora was underutilised the majority of the time with some large inefficient tasks causing resource spikes. We were able to identify the potential saving and then justify the effort to make application changes with a solid cost/benefit analysis.
While EC2 instances/Compute are the obvious choice, don’t forget about EBS volumes. Are they sized correctly? Are you using the latest generation, i.e. gp3 over gp2 storage options?
6. Turn it off
This is AWS after all. Are there workloads or parts of workloads that can be turned off? Have they become defunct over time, or have you leaked EBS volumes?
Do those staging and integration environments need to run 24×7? We turn on/off integration environments as required, for example. This isn’t a huge saving for us when looking down the bill, but in our case it was low effort to implement. As with everything, this is workload dependent.
7. EC2 Reserved Instances/Savings Plans
If you’ve completed right sizing and have a workload that allows you to commit to a minimum amount of compute for at least one year, look to make use of Reserved Instances – or now EC2 Savings Plans – which allow you to make use of AWS discounts for committing to a minimum spend over one to three years. It’s not just EC2 – many services provide the option of RI’s, RDS and Elasticache, for example.
8. S3
We’ve called this out specifically because S3 is such a common service. It’s easy to start using the service, which is cheap at low volumes, and see it creep up over time without realising there are cost-saving options. We make use of S3 lifecycle rules and storage classes, ensuring data is removed that’s no longer needed, and using more cost-effective storage classes where we can, such as Standard-IA, Glacier etc.
Keep an eye out for AWS Data Transfer Costs – these can add up for transferring large amounts of data out of AWS, or making many small read/write requests.
9. Savings Plans
It’s a good idea to check the AWS docs carefully, possibly twice, because while Reserved Instances, EC2 Savings Plans and Savings Plans are similar, they are subtly different.
After we had made use of Reserved Instances where we could, we implemented Savings Plans. This allowed us to cover AWS ECS Fargate and AWS Lambda with discounts for committed spend. As a bonus, Savings Plans apply to all types of compute so they are easier to make use of over a longer duration.
Some services – Cloudfront and SageMaker, for example – have their own savings plans, so look out for these.
10. EC2/Fargate Spot
We make heavy use of EC2 Spot for certain workloads – providing compute for automated testing for example. EC2 Spot can provide huge savings but can require more effort to implement.
You have to be OK with having your workload interrupted. You can make interruptions less frequent by careful use of the options AWS provide, but by definition spot instances can and will be interrupted and will potentially have zero capacity available in some cases.
The savings can be substantial enough that it’s often worth some ‘out of the box’ thinking to see if you can make your application resistant to spot interruption and capacity fluctuations.
For our CI workload, we’re using EC2 Fleet, with multiple instance types, multiple AZs and a capacity-optimised selection, which is proving much more reliable than our initial naive implementation of spot instances.
11. Traffic and NAT Gateways
Traffic is harder to optimise for and something to start considering if you start to see this become a larger line item on your bill. Careful selection of how clusters are deployed, cross AZ traffic and cross region traffic all come into play. We’re fortunate that our workloads are not traffic heavy, so this hasn’t been something we’ve had to heavily optimise for.
NAT Gateway costs are something to look out for, though – we managed to save a not insubstantial amount by implementing VPC Endpoints for AWS Elastic Container Registry, so container downloads didn’t traverse the NAT Gateway and incur extra charges.
Look out for traffic that uses NAT gateways, and if it’s traffic destined for AWS services, take a look at the cost/benefit of implementing AWS VPC Endpoints for those services.
12. Autoscaling
This is often quoted but is something we have yet to implement for our production workload. The reason is we simply had lower-hanging fruit to pick in our cost optimisation process. But it’s definitely something that we’ll be looking towards soon.
In summary
There are no magic bullets, and no third-party wonder tools. However, what has worked for us is being able to measure progress (KPIs) and have access to people who have a thorough understanding of the workload in question.
Take a systematic approach to work through the workloads and services from most to least expensive. For each one, apply the basic options that AWS makes available to increase cost efficiency.
Don’t implement everything for every workload, assess impact vs effort, and don’t forget the opportunity lost by venturing on a large cost optimisation effort.