Moving 100,000 customers from co-lo data centres to the cloud. With zero downtime.

Posted by on December 21, 2020

Running a SaaS app in AWS in 2020 is, in itself, not a particularly remarkable thing. Migrating a complex Rails app that is used daily by over 100,000 customers to an entirely different infrastructure, introducing a new underlying architecture, and doing so without customer downtime, really is something special. 

In this article we take a look behind the curtain to see how we achieved this recently at FreeAgent.


It’s almost a year to the day since I wrote about our decision to move the FreeAgent infrastructure from co-located data centres to AWS and I am thrilled to announce that on 1st December 2020 the primary migration was completed. The FreeAgent app is now running 100% in AWS! 🎉

Firstly, what has changed?

In our former co-located data centre (DC) world, we used bare metal servers in our own private cloud using Joyent’s Triton Compute. Hypervisors ran SmartOS and our infrastructure was managed by Puppet. We pretty much managed all the services ourselves in-house, including (but not limited to):

  • A 5-node MySQL cluster in each DC, each server having 24 cores, 128GB RAM and a dedicated SSL pool
  • Two RabbitMQ clusters per DC, for each environment
  • Two Elasticsearch clusters per DC, for each environment
  • Two Zookeeper clusters per DC, for each environment
  • NGINX for load balancing
  • 7 web servers in each DC, each running 13 unicorn workers for the Rails app
  • 8 job servers in each DC, each running 8 delayed job workers
  • Grafana + Graphite for metrics
  • Deployments managed by our own in-house system, Parachute 
  • CI managed by Jenkins with a custom test_queue setup. 6,000 builds/month, 450 production deployments/month, 50,000+ tests per build and deployment run. This was powered by 120 agent nodes, 576 cores and a whopping 4.5TB RAM. 
  • Nessus for vulnerability audits
  • Nagios for infrastructure and app monitoring/alerting
  • Mcollective for orchestration

Not everything was run in-house. Over the years, even before the AWS migration was considered, we were making use of AWS S3, Redshift and Lambda, but these were very much supplemental technologies.

In our new AWS home, the high-level stack looks very different. We are leaning heavily on managed services which, in theory, will free up a significant amount of time for our platform  engineers so they can focus on scaling, tooling, DevOps training and developer evangelism across the engineering organisation. 

The primary services we are using in our AWS world are:

  • Infrastructure as code using Terraform, Gruntwork and Atlantis
  • 2-node Aurora cluster (db.r5.8xlarge: 32 vCores, 256GB RAM)
  • Elastic Load Balancers, both ALB and NLB
  • KMS for secure key management
  • Amazon Elasticsearch Service
  • RabbitMQ managed by CloudAMQP
  • Docker containers managed by ECS Fargate 
  • Cloudwatch Metrics
  • SageMaker
  • 48 EC2 instances for Jenkins (682 compute cores, 1.5TB RAM)
  • Deployments managed by Harness
  • Cloudfront CDN
  • AWS Lambda – serverless compute
  • AWS SecurityHub and friends (GuardDuty, AWS Config, Macie) giving us improved, real time visibility into the overall security posture of our estate

Not all of our AWS capacity is required 24/7. Integration environments (used by engineers to push pre-production branches for testing) are shut down out-of-hours. Our build and test infrastructure scales up for our working day load, and for overnight automated test runs. We can balance having compute capacity when we need it, scaling to temporary levels for performance testing, and we’re not trapped into paying for capacity we only occasionally use. 

Data Centre ‘Flips’ 

We’ve been doing ‘data centre (DC) flips’ for years in our co-located world. Previously we operated two geographically isolated DCs – one live, one standby – and every quarter we would switch between them to fully test our disaster recovery process and allow us to perform maintenance in the standby DC. I wrote more about this back in 2017

For the migration to our new infrastructure, we treated AWS as just another data centre. We took the process of flipping between our standby and live DCs and made modifications so it worked seamlessly with our AWS infrastructure. 

The process of a flip can be broken out into two stages: the preparation work (“pre-flip”) and the actual traffic migration (“the flip”).

Pre-flip

  • Configure primary-primary MySQL replication between our data centre and Aurora
  • Freeze deploys at 12pm the day before, allowing us to go through our pre-flight checks and prepare pull requests required to do the flip without interruption
  • Go through pre-flight checklists, such as making sure all the tools/scripts used to do the flip are working, making sure our AWS environment is scaled up ready to handle the load, maintenance periods are scheduled and the team and customers are notified
  • Create pull requests for the flip and for the roll back. Pretty much everything we do is driven by pull requests which are for disabling job workers and scheduled tasks in the live site, enabling job workers and scheduled tasks in the new site, and DNS changes to re-point everything to AWS
  • Run some standard deploys to make sure all the deployment pipelines are working, making sure we have Jenkins nodes ready to go as they get shut down when not in use

The Flip

  • Push a deploy out to the live site to disable job workers
  • Put the database into read-only mode
  • Check replication is up to date between the live and target sites
  • Make a note of the database binary log position just in case we need it (thankfully we have never needed it!)
  • Put the target database into read-write mode
  • Deploy DNS changes (we have a 30s TTL on our domains)
  • Push a deploy out to enable job workers in the target site
  • TEST!

This final testing phase is where we have a number of people (typically product engineers) kick the tyres of every feature in the app. These people are specialists who know their area inside out and can very quickly figure out whether something looks amiss. During this phase we are also paying close attention to Rollbar (for application exceptions), monitoring DataDog (our monitoring platform) and Humio (our log aggregator).

The whole flip process only takes around 10-15 minutes and most of that is waiting for the deploys to go out to disable and re-enable job workers. We are in read-only mode in both sites for less than 1 minute and the majority of DNS resolves respect our TTL, so for most customers there would have been no more than a minute or so where the app may have not been operating fully. 

One major advantage of having such a reliable, repeatable process is that we lower the barrier to undertaking a DC flip. The value of this is that we can run multiple test migrations with real production traffic before making one final leap.

How we approached the production migration

Four weeks prior to the final, planned move, we ran our first actual production flip. The idea was to run production traffic on our new AWS infrastructure to smoke test the environment with actual customer traffic. After 90 minutes we would flip back to our data centres and spend the subsequent days analysing the telemetry data to try and identify any issues or unexpected performance bottlenecks. We had already spent a large amount of time undertaking load tests on the new infrastructure, so we were actually pretty confident that Aurora would be able to handle the load as well as our MySQL clusters do.

Flip One

Our first attempt was called off at the eleventh hour, just as we were finishing the very last checks on our pre-flight checklist. We hit two issues that didn’t manifest in our testing on our staging environment. Our ECS tasks failed to deploy due to being unable to correctly decrypt secrets. This turned out to be a size limitation in our implementation. Although tested on staging, some production secrets were longer. It turns out production secrets were about 24 bytes over this limit, and our staging tests about 4 bytes under!

Take Two

The second issue manifested itself as a missing credentials error for one of our deployment tasks. During deployment to AWS, an ECS task is launched that uses elastic-whenever to create scheduled tasks that launch Fargate tasks. After triaging the issue we realized the error was misleading. We were actually hitting an AWS API rate limit when elastic-whenever was attempting to schedule tasks. Again we had tested on staging, but with ten fewer tasks than we were now attempting to schedule in production.

Once these issues were fixed, the second attempt didn’t go quite according to plan either!  While there were no issues with the flip process itself, there were a couple of minor issues that affected the monitoring – we couldn’t see Real User Monitoring in DataDog, for example. We also found an SSL-related misconfiguration which meant some background jobs failed. These were re-run after we flipped back so customers were not affected.

Third time lucky

The third attempt went even less according to plan! As part of the move we were moving from long-lived virtual machines to containers that are recreated on each deploy. This requires a different deployment strategy, so we elected to go with Harness for AWS while retaining our existing pipeline in parallel.

On the night of the flip we pushed out a test deployment to AWS and it failed due to Harness being unable to connect to its delegates, which are themselves ECS containers.

After some investigation it appeared that the delegates had been restarted due to some ECS maintenance. While new containers had been provisioned automatically, Harness couldn’t see them. We called off the flip, again at the eleventh hour, and investigated the issue further the following morning. 

It turned out that these containers hadn’t been restarted in some time and were still running with our trial configuration details rather than the required production configuration. This difference only manifested itself once the containers were restarted.

Go Forth And Conquer

We scheduled the next flip a few days later and this fourth flip exceeded expectations! 

As with the previous flips, we were only planning on moving to AWS for 90 minutes to make sure we had fixed all the issues from the first flip and be on the lookout for other issues we had missed. Everything looked good, the monitoring was working as expected and SSL issues resolved.

We did, however, spot one performance-related issue with bank statement imports. We knew in AWS the latency between the application servers and the database was going to be increased by a few milliseconds. For the most part this isn’t much of an issue, however some bank statement imports were taking 2-3x longer than before. Looking at the traces in DataDog it became obvious what the problem was: this particular job was importing 115 bank transactions which generated 3801 spans. 1192 of these spans were database queries. Tack on a few milliseconds to each query due to the increased latency and suddenly you’re dealing with queries that are taking multiple seconds!

Seasoned Rails developers will probably recognise this as the “N+1 query problem“. In this case it wasn’t considered a major issue since these jobs run in the background, but we still felt the need to fix this before the final flip.

The final migration

Having completed a successful flip on Thursday 26th Nov, we scheduled the final migration for the following Tuesday, 1st Dec. Having run the process four times in the previous three weeks, this particular evening, while being a hugely momentous one for the business, followed the exact same process and ran even more smoothly than the previous four. 75 minutes after starting the migration, everyone said goodnight and went to bed! You might think this was something of an anti-climax, which is great because that’s exactly the result we wanted – no drama!

Celebration 🎉

Getting FreeAgent into AWS was the primary goal of a project we kicked off almost two years ago and it’s something we’ll take the time to celebrate. An enormous amount of effort, in both time invested in the implementation as well as significant upskilling on the job by a team of nearly 20 people in total.

Not only have we migrated our infrastructure, in flight, with virtually zero impact on our 100,000+ customers, we have also managed to do this without materially changing the workflow or productivity of our 100+ engineering staff. A project of this magnitude could easily have distracted many of our engineering teams and had a significant impact on company objectives, but our cloud migration team were very aware of this from the start and did an incredible job ensuring minimal, if not negligible impact on our engineers. It’s a remarkable accomplishment!

Next steps

There is still work to do on fully decommissioning our data centres (does anyone want to buy 50 Dell hypervisors? 😊) in the new year, and then the work will start on the next phase of the project. We’ll be evaluating the usage patterns of our customers and trying to better understand how the new infrastructure operates at scale. We will be looking at cost optimisations and to reduce unnecessary operational overhead. Finally, we will be embarking on longer term work to look at how AWS services can be used to our advantage. One of the big draws of being in AWS is the toolkit that we get at our disposal, and leveraging this to deliver customer impact and with increased productivity within our engineering organisation.

Thank you to everyone involved. It’s been an incredible, humbling experience to watch so many talented people deliver an insanely complex project and making it look easy 🙏

Answering bigger questions with BigQuery

Posted by on December 1, 2020

Over the past few weeks, we’ve configured BigQuery to enable us to combine our Google Analytics (GA) front-end data with our internal back-end data. In this post I’m going to talk about why we needed to do this, how we went about it and what we are hoping to achieve as a result.

What’s the problem?

Historically, two separate systems have been used at FreeAgent to track, store and analyse data about users’ behaviour: GA and an internal system. We’ve used GA for front-end data, such as the browser through which a user accesses our webapp or the medium through which a user landed on our website (paid search, organic search, email, etc.). On the other hand, we’ve used the internal system for back-end data related to a user’s change in state: a user has enabled a bank feed in our app, a user has cancelled their subscription and so on. There’s been very little overlap between the two systems.

In many cases, having two separate heaps of data is fine. If we want to know whether the pricing page on our website is effectively communicating our pricing plans to prospective customers, the GA data alone tracks users viewing the page, clicking on buttons and signing up (or not). We can view this in GA. On the other hand, if we want to know whether use of the invoice features in our app is increasing over time, our internal data alone tracks how many invoices are created and how many customers we have. We can view this in our business intelligence platform, Looker.

Our previous setup

However, as we begin to saturate the easy questions with our reporting and analysis, more complex ones become more common. What impact does a prospective customer clicking on a banner on our website (GA data) have on their likelihood of using all of the features our app offers (internal data) six months later? A banner on our site might appear to drive lots of signups but if those signups are less likely to use all of our features and more likely to churn then it may not be as good a banner as the GA data alone had suggested. The two sets of data complement each other and tell a fuller and truer story together than they do in isolation.

What’s the solution, and how did we implement it?

With a strong desire from the rest of the business to have answers to these more complex questions, we decided that it was important to find a robust and seamless way to combine data from the two sources and make it readily available in Looker.

The goal

We’ve technically been combining these sources of data for some time using the GA API. For any given new customer signup, we used the API to extract the corresponding source, medium and campaign (such as google, paid search, brand campaign). However, the level of detail and configurability of the GA API leaves a lot to be desired. 

Enter BigQuery. First and foremost, BigQuery is a cloud data warehouse and part of the Google Cloud Platform. However – and most importantly for us – GA can be linked to BigQuery to export granular GA data on a regular basis. This data can be unnested to a ‘hit’ level – essentially showing every move a tracked user makes on our website. This level of granularity far outweighs what the GA API offers and with a few lines of SQL can unlock some powerful insights.

We set up the ‘GA to BigQuery’ regular export link by following this guide and we found that Google has made this a simple and seamless process. This means that each morning at around 8am the previous day’s GA data lands in our BigQuery project.

Ta da!

It then would have been very simple to hook Looker directly up to BigQuery to access the granular GA data. However, this would essentially be recreating GA’s reporting capabilities in Looker – something that Looker is neither designed nor optimal for. What we needed to do was take cleaned snippets of the GA data and combine it with our internal data. This ‘cleaning’ and ‘combining’ process is more complex than Looker can handle alone and we decided that it should take place in a cloud data warehouse.

Since we already use Redshift as our primary cloud data warehouse tool and the rest of the organisation is heavily invested in the AWS route, we had no desire to rebuild our entire reporting infrastructure in BigQuery. However, GA doesn’t offer a direct-to-Redshift export. This means that to combine our GA data with our Redshift-based internal data we must involve BigQuery.

We decided to use Matillion, which already handles our regular data extract, transform and load (ETL) processes, to extract snippets of the GA data from BigQuery, transform those snippets into a shape that’s useful for us and load them into Redshift. 

We built three core tables from the GA data:

  1. A sessions table, where each row represents an instance of a user visiting our website or webapp
  2. An events table, where each row represents an event that occurred on our website or webapp (such as a user clicking on a banner)
  3. A pageviews table, where each row represents a user being on a page on our website or webapp.

In addition to these, we built some mapping tables that allow us to tie the GA data to our internal data. All GA data comes with a session_id, all internal data has a company_id and some GA data has both a session_id and a company_id (tracked as a custom dimension in GA) which allows us to tie the data together.

Our new setup

We can then surface these new GA tables in Looker and use the mapping tables, where possible, to tie it to our internal data. In some cases, we want to take our internal data and supplement it with snippets of GA data while in other cases, it’s the other way around.

Some ethical considerations

In a post about analysing user behaviour more intricately than we ever have before, it feels important to consider the implications of this from an ethical perspective. 

Some users on our website opt to block GA tracking. For these users, we are unable to tie GA data to our internal data, simply because there isn’t any GA data. However, the majority of users allow this tracking and in doing so are trusting us to use their data in a responsible manner. We need to ensure that we respect both our existing users and our prospective customers by using their data responsibly.

One side of the analysis is signup-focused: we’re analysing the effectiveness of our acquisition activity. We want to ensure that our campaigns are as informative as they can be and that those who sign up for FreeAgent as a result of one of those campaigns do so because FreeAgent is the right product for them. If a user stops using FreeAgent after three months then it suggests that something about the original campaign may not be right and we need to thread that information together so that we can act on it. We undertake analysis on our acquisition in the knowledge that our existing users love our product and so, for those it’s appropriate for, we want to maximise our reach as much as possible.

The other side of the analysis is focused around ongoing usage of our product. It’s important that we understand which of our features are helping our users nail their daily admin and which we need to improve. Again, we do so in the knowledge that we have an award-winning product but we’re also committed to learning how to make it even better.

What are we hoping to achieve?

As I mentioned above, our aim here is not to replace GA as a reporting tool. GA is a product that has undergone years of development to do what it does and it does it very well.

Rather, we want our data consumers across the business in product, marketing and beyond to be able to use Looker to answer more complex questions than is possible at the moment – questions that require both GA and internal data to answer. We also hope that in offering up this combined data, our data consumers will be prompted to think up many other complex questions that they might not have thought of before.

We’ve built up and recorded a backlog of “BigQuery questions” – questions that we knew would be much easier to answer once we’d implemented this new solution. Our team’s goal for this cycle was to have answered five questions off of this backlog using our new solution. 

Beyond this cycle, however, we want to achieve much more. The power of a business intelligence tool like Looker is that our data consumers ask and answer their own questions without our team even knowing that it’s happened. Success, in the long run, is when we see action being taken as a result of this combined data without even being aware that the question was being asked in the first place.