Moving 100,000 customers from co-lo data centres to the cloud. With zero downtime.

Running a SaaS app in AWS in 2020 is, in itself, not a particularly remarkable thing. Migrating a complex Rails app that is used daily by over 100,000 customers to an entirely different infrastructure, introducing a new underlying architecture, and doing so without customer downtime, really is something special.

In this article we take a look behind the curtain to see how we achieved this recently at FreeAgent.

It’s almost a year to the day since I wrote about our decision to move the FreeAgent infrastructure from co-located data centres to AWS and I am thrilled to announce that on 1st December 2020 the primary migration was completed. The FreeAgent app is now running 100% in AWS! 🎉

Firstly, what has changed?

In our former co-located data centre (DC) world, we used bare metal servers in our own private cloud using Joyent’s Triton Compute. Hypervisors ran SmartOS and our infrastructure was managed by Puppet. We pretty much managed all the services ourselves in-house, including (but not limited to):

A 5-node MySQL cluster in each DC, each server having 24 cores, 128GB RAM and a dedicated SSL pool
Two RabbitMQ clusters per DC, for each environment
Two Elasticsearch clusters per DC, for each environment
Two Zookeeper clusters per DC, for each environment
NGINX for load balancing
7 web servers in each DC, each running 13 unicorn workers for the Rails app
8 job servers in each DC, each running 8 delayed job workers
Grafana + Graphite for metrics
Deployments managed by our own in-house system, Parachute
CI managed by Jenkins with a custom test_queue setup. 6,000 builds/month, 450 production deployments/month, 50,000+ tests per build and deployment run. This was powered by 120 agent nodes, 576 cores and a whopping 4.5TB RAM.
Nessus for vulnerability audits
Nagios for infrastructure and app monitoring/alerting
Mcollective for orchestration

Not everything was run in-house. Over the years, even before the AWS migration was considered, we were making use of AWS S3, Redshift and Lambda, but these were very much supplemental technologies.

In our new AWS home, the high-level stack looks very different. We are leaning heavily on managed services which, in theory, will free up a significant amount of time for our platform engineers so they can focus on scaling, tooling, DevOps training and developer evangelism across the engineering organisation.

The primary services we are using in our AWS world are:

Infrastructure as code using Terraform, Gruntwork and Atlantis
2-node Aurora cluster (db.r5.8xlarge: 32 vCores, 256GB RAM)
Elastic Load Balancers, both ALB and NLB
KMS for secure key management
Amazon Elasticsearch Service
RabbitMQ managed by CloudAMQP
Docker containers managed by ECS Fargate
Cloudwatch Metrics
SageMaker
48 EC2 instances for Jenkins (682 compute cores, 1.5TB RAM)
Deployments managed by Harness
Cloudfront CDN
AWS Lambda – serverless compute
AWS SecurityHub and friends (GuardDuty, AWS Config, Macie) giving us improved, real time visibility into the overall security posture of our estate

Not all of our AWS capacity is required 24/7. Integration environments (used by engineers to push pre-production branches for testing) are shut down out-of-hours. Our build and test infrastructure scales up for our working day load, and for overnight automated test runs. We can balance having compute capacity when we need it, scaling to temporary levels for performance testing, and we’re not trapped into paying for capacity we only occasionally use.

Data Centre ‘Flips’

We’ve been doing ‘data centre (DC) flips’ for years in our co-located world. Previously we operated two geographically isolated DCs – one live, one standby – and every quarter we would switch between them to fully test our disaster recovery process and allow us to perform maintenance in the standby DC. I wrote more about this back in 2017.

For the migration to our new infrastructure, we treated AWS as just another data centre. We took the process of flipping between our standby and live DCs and made modifications so it worked seamlessly with our AWS infrastructure.

The process of a flip can be broken out into two stages: the preparation work (“pre-flip”) and the actual traffic migration (“the flip”).

Pre-flip

Configure primary-primary MySQL replication between our data centre and Aurora
Freeze deploys at 12pm the day before, allowing us to go through our pre-flight checks and prepare pull requests required to do the flip without interruption
Go through pre-flight checklists, such as making sure all the tools/scripts used to do the flip are working, making sure our AWS environment is scaled up ready to handle the load, maintenance periods are scheduled and the team and customers are notified
Create pull requests for the flip and for the roll back. Pretty much everything we do is driven by pull requests which are for disabling job workers and scheduled tasks in the live site, enabling job workers and scheduled tasks in the new site, and DNS changes to re-point everything to AWS
Run some standard deploys to make sure all the deployment pipelines are working, making sure we have Jenkins nodes ready to go as they get shut down when not in use

The Flip

Push a deploy out to the live site to disable job workers
Put the database into read-only mode
Check replication is up to date between the live and target sites
Make a note of the database binary log position just in case we need it (thankfully we have never needed it!)
Put the target database into read-write mode
Deploy DNS changes (we have a 30s TTL on our domains)
Push a deploy out to enable job workers in the target site
TEST!

This final testing phase is where we have a number of people (typically product engineers) kick the tyres of every feature in the app. These people are specialists who know their area inside out and can very quickly figure out whether something looks amiss. During this phase we are also paying close attention to Rollbar (for application exceptions), monitoring DataDog (our monitoring platform) and Humio (our log aggregator).

The whole flip process only takes around 10-15 minutes and most of that is waiting for the deploys to go out to disable and re-enable job workers. We are in read-only mode in both sites for less than 1 minute and the majority of DNS resolves respect our TTL, so for most customers there would have been no more than a minute or so where the app may have not been operating fully.

One major advantage of having such a reliable, repeatable process is that we lower the barrier to undertaking a DC flip. The value of this is that we can run multiple test migrations with real production traffic before making one final leap.

How we approached the production migration

Four weeks prior to the final, planned move, we ran our first actual production flip. The idea was to run production traffic on our new AWS infrastructure to smoke test the environment with actual customer traffic. After 90 minutes we would flip back to our data centres and spend the subsequent days analysing the telemetry data to try and identify any issues or unexpected performance bottlenecks. We had already spent a large amount of time undertaking load tests on the new infrastructure, so we were actually pretty confident that Aurora would be able to handle the load as well as our MySQL clusters do.

Flip One

Our first attempt was called off at the eleventh hour, just as we were finishing the very last checks on our pre-flight checklist. We hit two issues that didn’t manifest in our testing on our staging environment. Our ECS tasks failed to deploy due to being unable to correctly decrypt secrets. This turned out to be a size limitation in our implementation. Although tested on staging, some production secrets were longer. It turns out production secrets were about 24 bytes over this limit, and our staging tests about 4 bytes under!

Take Two

The second issue manifested itself as a missing credentials error for one of our deployment tasks. During deployment to AWS, an ECS task is launched that uses elastic-whenever to create scheduled tasks that launch Fargate tasks. After triaging the issue we realized the error was misleading. We were actually hitting an AWS API rate limit when elastic-whenever was attempting to schedule tasks. Again we had tested on staging, but with ten fewer tasks than we were now attempting to schedule in production.

Once these issues were fixed, the second attempt didn’t go quite according to plan either! While there were no issues with the flip process itself, there were a couple of minor issues that affected the monitoring – we couldn’t see Real User Monitoring in DataDog, for example. We also found an SSL-related misconfiguration which meant some background jobs failed. These were re-run after we flipped back so customers were not affected.

Third time lucky

The third attempt went even less according to plan! As part of the move we were moving from long-lived virtual machines to containers that are recreated on each deploy. This requires a different deployment strategy, so we elected to go with Harness for AWS while retaining our existing pipeline in parallel.

On the night of the flip we pushed out a test deployment to AWS and it failed due to Harness being unable to connect to its delegates, which are themselves ECS containers.

After some investigation it appeared that the delegates had been restarted due to some ECS maintenance. While new containers had been provisioned automatically, Harness couldn’t see them. We called off the flip, again at the eleventh hour, and investigated the issue further the following morning.

It turned out that these containers hadn’t been restarted in some time and were still running with our trial configuration details rather than the required production configuration. This difference only manifested itself once the containers were restarted.

Go Forth And Conquer

We scheduled the next flip a few days later and this fourth flip exceeded expectations!

As with the previous flips, we were only planning on moving to AWS for 90 minutes to make sure we had fixed all the issues from the first flip and be on the lookout for other issues we had missed. Everything looked good, the monitoring was working as expected and SSL issues resolved.

We did, however, spot one performance-related issue with bank statement imports. We knew in AWS the latency between the application servers and the database was going to be increased by a few milliseconds. For the most part this isn’t much of an issue, however some bank statement imports were taking 2-3x longer than before. Looking at the traces in DataDog it became obvious what the problem was: this particular job was importing 115 bank transactions which generated 3801 spans. 1192 of these spans were database queries. Tack on a few milliseconds to each query due to the increased latency and suddenly you’re dealing with queries that are taking multiple seconds!

Seasoned Rails developers will probably recognise this as the “N+1 query problem“. In this case it wasn’t considered a major issue since these jobs run in the background, but we still felt the need to fix this before the final flip.

The final migration

Having completed a successful flip on Thursday 26th Nov, we scheduled the final migration for the following Tuesday, 1st Dec. Having run the process four times in the previous three weeks, this particular evening, while being a hugely momentous one for the business, followed the exact same process and ran even more smoothly than the previous four. 75 minutes after starting the migration, everyone said goodnight and went to bed! You might think this was something of an anti-climax, which is great because that’s exactly the result we wanted – no drama!

Celebration 🎉

Getting FreeAgent into AWS was the primary goal of a project we kicked off almost two years ago and it’s something we’ll take the time to celebrate. An enormous amount of effort, in both time invested in the implementation as well as significant upskilling on the job by a team of nearly 20 people in total.

Not only have we migrated our infrastructure, in flight, with virtually zero impact on our 100,000+ customers, we have also managed to do this without materially changing the workflow or productivity of our 100+ engineering staff. A project of this magnitude could easily have distracted many of our engineering teams and had a significant impact on company objectives, but our cloud migration team were very aware of this from the start and did an incredible job ensuring minimal, if not negligible impact on our engineers. It’s a remarkable accomplishment!

Next steps

There is still work to do on fully decommissioning our data centres (does anyone want to buy 50 Dell hypervisors? 😊) in the new year, and then the work will start on the next phase of the project. We’ll be evaluating the usage patterns of our customers and trying to better understand how the new infrastructure operates at scale. We will be looking at cost optimisations and to reduce unnecessary operational overhead. Finally, we will be embarking on longer term work to look at how AWS services can be used to our advantage. One of the big draws of being in AWS is the toolkit that we get at our disposal, and leveraging this to deliver customer impact and with increased productivity within our engineering organisation.

Thank you to everyone involved. It’s been an incredible, humbling experience to watch so many talented people deliver an insanely complex project and making it look easy 🙏

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team