Answering bigger questions with BigQuery

Over the past few weeks, we’ve configured BigQuery to enable us to combine our Google Analytics (GA) front-end data with our internal back-end data. In this post I’m going to talk about why we needed to do this, how we went about it and what we are hoping to achieve as a result.

What’s the problem?

Historically, two separate systems have been used at FreeAgent to track, store and analyse data about users’ behaviour: GA and an internal system. We’ve used GA for front-end data, such as the browser through which a user accesses our webapp or the medium through which a user landed on our website (paid search, organic search, email, etc.). On the other hand, we’ve used the internal system for back-end data related to a user’s change in state: a user has enabled a bank feed in our app, a user has cancelled their subscription and so on. There’s been very little overlap between the two systems.

In many cases, having two separate heaps of data is fine. If we want to know whether the pricing page on our website is effectively communicating our pricing plans to prospective customers, the GA data alone tracks users viewing the page, clicking on buttons and signing up (or not). We can view this in GA. On the other hand, if we want to know whether use of the invoice features in our app is increasing over time, our internal data alone tracks how many invoices are created and how many customers we have. We can view this in our business intelligence platform, Looker.

However, as we begin to saturate the easy questions with our reporting and analysis, more complex ones become more common. What impact does a prospective customer clicking on a banner on our website (GA data) have on their likelihood of using all of the features our app offers (internal data) six months later? A banner on our site might appear to drive lots of signups but if those signups are less likely to use all of our features and more likely to churn then it may not be as good a banner as the GA data alone had suggested. The two sets of data complement each other and tell a fuller and truer story together than they do in isolation.

What’s the solution, and how did we implement it?

With a strong desire from the rest of the business to have answers to these more complex questions, we decided that it was important to find a robust and seamless way to combine data from the two sources and make it readily available in Looker.

We’ve technically been combining these sources of data for some time using the GA API. For any given new customer signup, we used the API to extract the corresponding source, medium and campaign (such as google, paid search, brand campaign). However, the level of detail and configurability of the GA API leaves a lot to be desired.

Enter BigQuery. First and foremost, BigQuery is a cloud data warehouse and part of the Google Cloud Platform. However – and most importantly for us – GA can be linked to BigQuery to export granular GA data on a regular basis. This data can be unnested to a ‘hit’ level – essentially showing every move a tracked user makes on our website. This level of granularity far outweighs what the GA API offers and with a few lines of SQL can unlock some powerful insights.

We set up the ‘GA to BigQuery’ regular export link by following this guide and we found that Google has made this a simple and seamless process. This means that each morning at around 8am the previous day’s GA data lands in our BigQuery project.

It then would have been very simple to hook Looker directly up to BigQuery to access the granular GA data. However, this would essentially be recreating GA’s reporting capabilities in Looker – something that Looker is neither designed nor optimal for. What we needed to do was take cleaned snippets of the GA data and combine it with our internal data. This ‘cleaning’ and ‘combining’ process is more complex than Looker can handle alone and we decided that it should take place in a cloud data warehouse.

Since we already use Redshift as our primary cloud data warehouse tool and the rest of the organisation is heavily invested in the AWS route, we had no desire to rebuild our entire reporting infrastructure in BigQuery. However, GA doesn’t offer a direct-to-Redshift export. This means that to combine our GA data with our Redshift-based internal data we must involve BigQuery.

We decided to use Matillion, which already handles our regular data extract, transform and load (ETL) processes, to extract snippets of the GA data from BigQuery, transform those snippets into a shape that’s useful for us and load them into Redshift.

We built three core tables from the GA data:

A sessions table, where each row represents an instance of a user visiting our website or webapp
An events table, where each row represents an event that occurred on our website or webapp (such as a user clicking on a banner)
A pageviews table, where each row represents a user being on a page on our website or webapp.

In addition to these, we built some mapping tables that allow us to tie the GA data to our internal data. All GA data comes with a session_id, all internal data has a company_id and some GA data has both a session_id and a company_id (tracked as a custom dimension in GA) which allows us to tie the data together.

We can then surface these new GA tables in Looker and use the mapping tables, where possible, to tie it to our internal data. In some cases, we want to take our internal data and supplement it with snippets of GA data while in other cases, it’s the other way around.

Some ethical considerations

In a post about analysing user behaviour more intricately than we ever have before, it feels important to consider the implications of this from an ethical perspective.

Some users on our website opt to block GA tracking. For these users, we are unable to tie GA data to our internal data, simply because there isn’t any GA data. However, the majority of users allow this tracking and in doing so are trusting us to use their data in a responsible manner. We need to ensure that we respect both our existing users and our prospective customers by using their data responsibly.

One side of the analysis is signup-focused: we’re analysing the effectiveness of our acquisition activity. We want to ensure that our campaigns are as informative as they can be and that those who sign up for FreeAgent as a result of one of those campaigns do so because FreeAgent is the right product for them. If a user stops using FreeAgent after three months then it suggests that something about the original campaign may not be right and we need to thread that information together so that we can act on it. We undertake analysis on our acquisition in the knowledge that our existing users love our product and so, for those it’s appropriate for, we want to maximise our reach as much as possible.

The other side of the analysis is focused around ongoing usage of our product. It’s important that we understand which of our features are helping our users nail their daily admin and which we need to improve. Again, we do so in the knowledge that we have an award-winning product but we’re also committed to learning how to make it even better.

What are we hoping to achieve?

As I mentioned above, our aim here is not to replace GA as a reporting tool. GA is a product that has undergone years of development to do what it does and it does it very well.

Rather, we want our data consumers across the business in product, marketing and beyond to be able to use Looker to answer more complex questions than is possible at the moment – questions that require both GA and internal data to answer. We also hope that in offering up this combined data, our data consumers will be prompted to think up many other complex questions that they might not have thought of before.

We’ve built up and recorded a backlog of “BigQuery questions” – questions that we knew would be much easier to answer once we’d implemented this new solution. Our team’s goal for this cycle was to have answered five questions off of this backlog using our new solution.

Beyond this cycle, however, we want to achieve much more. The power of a business intelligence tool like Looker is that our data consumers ask and answer their own questions without our team even knowing that it’s happened. Success, in the long run, is when we see action being taken as a result of this combined data without even being aware that the question was being asked in the first place.

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team

Answering bigger questions with BigQuery

What’s the problem?

What’s the solution, and how did we implement it?

Some ethical considerations

What are we hoping to achieve?

We're totally hiring!