The legendary data warehouse application

Posted by on March 24, 2022

Warehouse

In spring 2014, I created an internal Ruby on Rails app called data-warehouse. Known for the last few years as “the legacy data warehouse” we finally deactivated the app earlier this month.

In this blog post I’m going to describe what it was, how it came to be and why, against what would appear to be all reasonable logic, it became the central part of our data infrastructure for about eight years.

2014 – In the data cave

I had been at FreeAgent for a few months when I started building an app to make it easier to share answers to some of the typical questions I was asked by the marketing team day to day.

The answers to these questions typically involved using APIs to extract data from systems like Salesforce and Google Analytics, as well as the results of models and simulations stored in a MySQL database on my local machine. With FreeAgent being a Rails app, it seemed like a sensible idea to learn and use the framework that everyone else was using, even if it wasn’t a typical choice for the problem in hand.

After a bit of local experimentation to get to grips with whenever and Rake to schedule everything, I got the app running on my iMac 😎

Looking back, the first year or so was really about establishing what kinds of data we needed to collect and what kinds of questions needed to be answered. The app provided a useful focal point for that, as well as being an opportunity for me to learn a lot about software engineering.

2015 – 2017 – The golden age

Over the previous year the demand for the app had been clearly proven. When it was time for us to relocate from Torphichen Street to our current office in Fountainbridge in summer 2015, we moved the app to a proper hosting solution on our hardware in the bunker. So yeah, I did run the app on my iMac for over a year. The iMac did not recover.

Between 2015 and 2017 the codebase increased by a factor of around five and was by now extracting, transforming and storing data from Salesforce, Google Analytics, Marketing Automation systems, Delighted NPS as well as consuming event data from the FreeAgent application via RabbitMQ.

Number of lines of code over time
Produced with git-of-theseus

With a bit of custom JavaScript, the app now allowed users to explore datasets like customer subscription simulations and the structure of customer marketing journeys through interactive dashboards.

Example interactive dashboards

While the interactive dashboards allowed some degree of “self serve” access to pre-defined datasets, it started to become clear that something else would be needed as the company grew larger. The main challenges we faced were:

  • The development time to create a new view, dashboard or visualisation could be days.  While this had been an improvement on the “run it by hand” approach before the app existed, it was now too slow to keep up with demand.
  • The original secret weapon of being able to benefit from the experience of the wider engineering team when I was the only data scientist wasn’t so helpful now that we started to recruit for new specialist analytics roles. Rails development and data analyst skills are a fairly rare combination.
  • Some of the tables were getting into the tens of millions of rows. While this isn’t exactly huge, some queries were starting to become slow and it was apparent that a row-based data store probably wasn’t appropriate anymore.

By spring 2017, we’d started to investigate more standard cloud-based alternatives that fit the more typical description of a data warehouse. During one of our company hack days I worked with Rike Jones in the marketing team to evaluate a couple of free BI tools on top of Amazon Redshift with a few example datasets from the data-warehouse app loaded in.

The first tool we tried was a free SQL-based tool called re:dash. It was a brilliant tool as it gave us the ability to create and share SQL queries, plus it had enough visualisation and dashboarding built in to meet almost all of our use cases. We adopted it immediately.

Self-serve reporting really started to sky rocket around the business. Analytics team members and business users – with a bit of SQL training – could now create reports and dashboards in minutes that would previously have taken days. It was around this time that we started referring to the data-warehouse application as the Legacy Data Warehouse.

2018 – 2022 – Removing things is hard

By now having a few users around the business with some SQL skills wasn’t enough to keep up with demand so we started searching for a BI tool that would allow non-technical users to work with data more effectively. In autumn 2018, we adopted Looker on top of our Redshift data warehouse and are still very happily using it today.

The legacy data warehouse was the first significant Rails app that we relocated from the bunker to AWS as part of our cloud migration project in 2019. The app was a non-trivial workload that allowed our platform engineering teams to perfect the methods they used to successfully migrate FreeAgent from co-lo data centres to the cloud with zero downtime in 2020.

Surely the legacy data warehouse should be redundant by now? Although the UI was removed the app still provided one central point for extracting and storing data from all the relevant source systems, and copying it to Redshift. During these couple of years we even added new data sources such as Zendesk and internal metrics collected from application logs and Slack.  Being a Ruby app that most of the engineering team knew how to contribute to was still useful after all.

Our general philosophy was to get the data model right upfront and do basic transformations in Redshift to keep the BI modelling layer in Looker as simple as possible. As a result, the total number of lines of code in the app started to increase again as it was used more and more to orchestrate the execution of SQL to transform data in Redshift.

In an example of classic advice not followed, we had created our own ETL orchestration tool. We knew that something else would be needed as these data pipelines grew in complexity from simple copying from A to B. But again the app had done the job, allowing us to quickly generate value for the business and proving the need for more tooling.

We started evaluating ETL tools and opted for Matillion in 2019 but it was still a couple of years before we fully switched off the legacy data warehouse. It sounds like it should have been faster but I think there lies the difficulty – we were not moving to a new tool while the world stood still around us. We were migrating the old requirements while implementing new ones. And there was always more to do!

It took around five years from deciding that the data-warehouse app probably wasn’t needed to deactivating it completely. During that time – more than half its total life time – it continued to be a useful and integral part of our data infrastructure. Perhaps there’s a rule in there somewhere about how long it takes to decommission systems. Lifetime to date plus a year?

Reflections

The data-warehouse application fulfilled many different roles. What it did changed over time, as the company grew from around 50 to 250 staff. 

I learned some Ruby on Rails by building it. Half a dozen or so business users explored interactive dashboards with my slightly dodgy UI in the early years. Finally, it was a central point of ingestion of data from third party services and an orchestration tool for our ETL processes, creating the datasets used by around eighty business users every week in Looker.

Would I advise anyone else to create a Rails app to do any of these things today? Almost certainly not! Many simple cloud-based tools are available to do these jobs with minimal effort.  

Did it make sense as an option eight or nine years ago? Probably borderline at the time but definitely in retrospect. Here’s why I think it was successful:

  • Picking a common technology mattered. In terms of infrastructure and development tooling support, working with a common technology to the wider engineering team mattered more than that technology being most optimal for the problem in hand. We could demonstrate value quickly and migrate to more appropriate tools later once we understood the problems better.  
  • Caring about testing mattered. Right from the start we had good programmatic test coverage. This made it easier for new developers to understand what the application was supposed to do, confidently make changes and minimise mistakes. As the requirements changed over time we were able to remove code that was no longer required with less concern about breaking something.
  • Caring about code quality mattered. I sometimes get the impression that writing good quality code with clarity of intent is not valued as much in data projects as typical software engineering. Requiring code reviews from the start and using tools like CodeClimate encouraged us to be proud of a codebase that wouldn’t scare off potential contributors. By this year 42 people had contributed to the app codebase.

Tagged ,

About the author…

Dave Evans was the first data scientist at FreeAgent and currently leads the data science and analytics teams. Before joining FreeAgent, Dave worked at the University of California, San Diego and CERN.

See all posts by Dave

Leave a reply

Your email address will not be published.