Seven years ago we started planning our first major infrastructure migration. Nine months later we made the move, taking FreeAgent from our first home in Rackspace London to a new, co-located home in two data centres (DCs) run by The Bunker. FreeAgent has been happily humming along in Ash and Greenham Common ever since.
Co-locating has been a terrific win for us over the years, providing us with a cost-effective, high performance compute platform that has allowed us to scale to over 95,000 customers. I wrote all about it back in 2017, and our uptime since then has continued to be excellent:
It’s fair to say this co-located infrastructure plan has been a huge success for the business, measured not only by great uptime but also by excellent performance (an end-user Apdex score of > 0.93 and 1.2s average end-user page load time) and with infrastructure costs as a percentage of ARR well below 1%.
Everything changes
As we know – especially in tech – nothing stands still. Everything changes. Since 2017 we’ve seen dramatic change at FreeAgent:
- Our company headcount has doubled, from ~120 in 2017 to > 240 today
- Engineering staff numbers have grown from 40 to > 100
- Customer numbers have increased from 65,000 to > 95,000 and are still growing quickly
This growth put us at a crossroads with respect to the infrastructure of the FreeAgent app. Co-location was undoubtedly the right decision for the business at the time, but we wanted to take the time to re-evaluate this position for a number of reasons:
- Was there an opportunity to increase resilience? We had developed world-class, data centre-level redundancy, introduced second cabinets in each DC (and established a networking pattern for expanding into additional cabinets), but we were still operating through a single provider. We felt the option of a third DC with an alternative provider would allow us to reduce this risk exposure, and it would also allow us to experiment with a more resilient ‘always-on’ architecture
- Reaching hardware limitations. We were starting to approach the limits of our existing physical hardware capacity that could only be resolved by manual upgrades and replacements (e.g. SSD pools, RAM, a third cabinet, third DC)
- Coping with demand on our Ops engineers. A big increase in both customers and engineers on our team had resulted in a large increase in demand on our Ops team from dealing with general infrastructure expansion, the need for greater focus on security obligations, as well as new requirements from fast-growing engineering and data science teams. To keep up we would need to make a significant investment in both physical infrastructure, hiring additional Ops engineers and site reliability staff (for keeping the lights on – KTLO), and in-house tooling
- Finding people with the right skills. It was proving to be challenging to hire Ops engineers who had existing skills – or had an interest in acquiring skills – in the technologies we were using, in particular Joyent’s Triton compute platform
- Progress in the Infrastructure/Platform as a Service space. Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) offerings had evolved enormously over the past 5 years and we wanted to take the time to consider what benefits they might bring us
In addition to this, there were some “known knowns” about the future of FreeAgent that we felt should influence our technical direction:
- Machine learning (ML) applications and services. We were already experimenting with an ML service and we knew we’d be building more in the future. An ML service would require its own hosting, deployment and monitoring if we hosted it internally with our other apps and services, so we wanted to have the option to leverage cloud-based services for this such as AWS SageMaker in order to deliver business value more quickly
- We would see an increase in our use of serverless compute. We were already using AWS Lambda in production and could see other areas where serverless would be useful
- We wanted to evolve our approach to deployments. Our deployment mechanism was fine, but we were already considering how to optimise our approach and unify it across all our services.
- Data Warehousing and Business Intelligence requirements. We had experienced a large increase in our use of event-tracking and data warehousing, and the demand for business intelligence and data reporting was increasing across the business
- Demand on our CI/CD pipeline was increasing fast. With more engineers on our team, the number of tests we were writing was growing quickly resulting in increased demand on our CI/CD platform. How could we scale this up, improve our average number of deploys per day per developer, and keep test run times as low as possible?
- Scaling the database is a big challenge. The amount of data we were storing on behalf of customers was increasing, with increasingly large database tables. How would we scale our database, would we need sharding for instance?
Ultimately there were three options in front of us: continue with our co-lo approach, take a hybrid approach (co-lo plus some cloud), or go full cloud. A decision of this magnitude isn’t something you can make easily, and whatever we chose to do would have a huge impact on the business for the next 5-10 years.
We felt we understood how to scale co-lo and understood the challenges. A hybrid approach didn’t fill us with joy (you get the best of both worlds, but also the worst of both!), so we wanted to focus our efforts on understanding the implications of migrating FreeAgent and its supporting services to public cloud infrastructure. In order to do this, we decided to undertake a time-boxed R&D exercise, over 3 months, that was geared to provide enough data and insight to allow us to make the necessary decision for the long-term future of infrastructure.
Which mountain should we climb?
As with every engineering project at FreeAgent, we kicked it off by writing an RFD:
The recommendations that we proposed in the RFD were:
- Choose AWS as the cloud provider
- This should not be a “lift and shift” project. We would focus on leveraging managed services wherever possible (e.g. Aurora instead of MySQL, AWS Elasticsearch instead of hosting our own)
- We would deploy the app and related services in containers using Docker
- Containers would be orchestrated using Kubernetes
- Infrastructure as code would be managed using Terraform or AWS CloudFormation
Despite only being able to commit two people full-time to the project, progress was rapid. After just six weeks, a report was written that went into detail on what had been learned, how our thinking had changed in some cases, and what progress we had made to date. Several weeks later we came to the end of the research stage of the project and a final, more detailed report was written that put forward a much clearer picture of what FreeAgent and our infrastructure would look like after a full migration to the cloud.
The outcome was encouraging. Being in a position to fully leverage compute on demand and take advantage of the increasing number of fully managed services that AWS offers was an extremely exciting prospect for all our engineering teams, as well as the wider business. Granted, any infrastructure migration would be expensive, the project complex and it would come with many challenges, but the advantages and opportunities that a full cloud migration would open up in the future were undeniable.
The decision was made to migrate to AWS! 🎉
Embarking on the journey
The next few months were spent writing up detailed reports and creating OKRs across multiple teams. We hired a project manager who was experienced in cloud migrations to help draw up the plan of precisely how we intended to undertake the migration and what resources – both people/time investment and cash – we would need for it to be successful. As well as recruiting staff to bring AWS experience to the team, we supported training for upskilling our existing team members. We also had to understand what additional work would be required during the migration period in our existing DCs, and what capacity we will need to bring online there (if any) in order to cope with interim demand. While we would ultimately be retiring our data centres, we absolutely had to retain our world-class service while the migration was happening.
Several months later, we are well on our way to our new home in AWS and we’re due to arrive around the middle of 2020!
It’s a fascinating time in the world of cloud computing and we’ve been presented with an opportunity to learn and take full advantage of cutting edge technologies that will allow us to move more quickly and take on challenges that have previously felt out of reach. We’re hoping to blog more regularly as we approach the big day, about the progress of the project and about the specific technical decisions we’re making along the way – for example, why didn’t we choose Kubernetes in the end? – so keep an eye out for that. And if you’re interested in getting involved and joining us on this exciting journey, we’re hiring!
Would be interesting to know whether you also considered Microsoft’s or Google’s cloud offerings, and what swung it for you to go with AWS
Hi Matthew. We did consider GCP but we were already using AWS in a number of ways in production (S3, Redshift, Lambda) so that – along with the fact that we feel it’s the most mature cloud platform out there – sealed the deal for us.
Great write-up. Very insightful. Just curious, did you ever consider using an AWS consulting service that provides AWS migration services? If so, why was the decision made to manage the migration in-house (it sounds like the decision was made to hire AWS experience such as the project manager).
Hi Mark. We have well over a decade of building and maintaining infrastructure, both in data centres and the cloud so we felt the migration was well within our capabilities. We’ll also be responsible for evolving and maintaining this infrastructure for years to come so it’s critical that our engineers understand in the deepest detail all aspects of the migration. As such, outsourcing the migration wasn’t the right strategy for the business and not something we considered.
Just to add to Olly’s response, early on in the R&D phase we became customers of Gruntwork.io and have relied heavily on their Infrastructure as Code library and training to accelerate the project.
While we’re handling the migration itself, we’ve benefitted from a tried and tested target environment and well-defined house style for structuring that environment and its code.