We’ve all been there – it’s a couple of days or even the night before an important deadline and one of the following happens:
- You lose your memory stick with all your work on it and scramble to recreate it but can’t get it back to the way it was.
- You make some more ‘advancements’ to your work and suddenly everything is broken and you can’t seem to get it back to the state where it was functioning as before.
I was in my second year of university when the first scenario happened two days before I was due to hand in a piece of coursework that I was particularly proud of. I remember that horrible sinking feeling when I realised that I had no clue where my little blue memory stick that usually lived on my keys was, and therefore had no clue where my coursework source code was. I promptly wrote my lecturer an email pleading for another few days to recreate the work I had lost and then sat in the computing suites multitasking between rapidly refreshing my inbox awaiting her reply, biting my nails, and trying to remember what my previous version of the code looked like.
After what felt like an eternity, but was more realistically about an hour and a half, a response came in – not the one I wanted, though. It read along the lines of:
“I’m sympathetic to your situation, Lea, but I cannot grant you any leniency here. After all, you are a second-year computing student and you really should be backing up your work at this point. Haven’t you heard of version control or GitHub?”
I had not. I can tell you one thing though: I have not created a project since which hasn’t lived on GitHub or been backed up somewhere else, and I will most likely never make that mistake again.
What is version control?
Imagine you’re walking down a path and you come to a junction. You think that taking a right turn is going to make your walk much nicer. It looks sunnier and the grass is green, so you turn off to the right. After walking for a while down this new route, you hit a river and you realise you can’t progress and need to go back to the junction and make a different decision. However, the only way for you to get back to the junction is to retrace your entire journey step by step – sounds tricky, right?
Similarly, when writing code, sometimes changes and advancements to your code can work out great, but sometimes you’ve overlooked where your changes could create a problem, and you need to go back. This is where the beauty of version control comes in, because the version control software tracks and stores each version of your code (each step you make). So we could easily jump back to the version where we’re standing at the junction, and then choose not to take a right, or rather a wrong, turn.
A popular version control software is Git, often used via GitHub, which is an internet hosted version control site for tracking software development. We use GitHub at FreeAgent to track changes to most of our code-based projects.
What is Matillion?
Matillion is the ETL (extract, transform, load) tool we use at FreeAgent for loading our data into the database. We use Matillion to load and transform our data into the best structure for reporting, meaning that the state of Matillion at any given time needs to account for the structure of our database at that time. We are constantly making changes in Matillion to include more data in our database and make adjustments to data that already exists, so it makes sense for us to keep track of those changes.
Matillion has a great user interface for visualising components in the database:
This makes it quite easy to see the different steps happening in any job without having to deal with complicated SQL. The downside of this is that there is no readable ‘code version’ of the state of Matillion at any given time, making it difficult to track changes made or search through code to see where a particular component or table is used in different jobs. This is why the ability to do a global search of Matillion is useful, because we can search through the code to find specific things we are looking for. The other main reason is related to the story I told in the beginning: it’s always good to have a backup of code in case of an error. Before the parser was implemented, if something broke due to a change in Matillion, we would have to rack our brains to figure out all the changes we made the day before and what could have caused the problem.
Matillion actually does have a git integration which the Analytics team set up when we started using the tool, but unfortunately it outputs a single very long line for each file which is almost unreadable and makes it really hard to see what’s going on:
Our solution? The Matillion parser.
How does the Matillion parser work?
The parser is a Python3 script that uses the Matillion and GitHub APIs to pull data from Matillion, transform it into a readable format and push it to GitHub. Below is an overview of the steps in the process:
- First, the script calls the Matillion API to get a list of the names of all our orchestration and transformation jobs. Then, for each of those jobs, it gets the JSON of content from the request and formats the JSON in a readable way using the following python json.dumps() function with the ‘indent’ argument:
The data is then encoded in a format that can be read and compared with content from the GitHub API, which requires base64 encoding and a secure hash.
For the content to match correctly when checking to see if a file has changed, it needs to be encoded and decoded in UTF-8 as well, so the following line is used to encode the JSON from the Matillion API to compare to the GitHub API:
- The parser then makes a call to GitHub to check if the corresponding file for that job exists already, and if it does, whether it needs to be updated or not.
- If a file needs to be created or updated then the parser makes a PUT call to the GitHub API to update or create as required. Finally, the script does a check to match the most recent list of jobs in Matillion with the current list of corresponding files in GitHub to see if there are files in GitHub that are no longer in the Matillion output, then it makes a call to delete those files – this keeps the repository up to date and not cluttered with files corresponding to jobs we no longer use.
How we use the parser to improve our time with Matillion.
So how does any of this help us? There are two main reasons:
- We can track all of the changes we have made to Matillion and see when they were made.
This reason, which luckily we haven’t needed to use the parser for yet, relates to my initial story – needing backup. Say that one day we turn on our laptops and see that there have been some alerts about something in our database being broken which relates to changes that have been made in Matillion. Now we can quickly and easily check the parser to see all of the changes that were made the previous day in the GitHub commit history:
This means that we will hopefully, quite quickly, be able to find the problem with less investigation and reliance on memory, and make the appropriate adjustments to the database and fix the problem.
- We can treat our Matillion setup as a searchable codebase.
This reason, which helps prevent the need for the first reason, is the ability we now have to search the GitHub repository. Now we can easily search the name of a component or table and get a list of every single place that component is used. In other words, we are far less likely to accidentally break something because we can know, with certainty, which other jobs will be affected by a change we are about to make and can preemptively make the appropriate changes before something gets wrongly affected and broken. This has already saved us time, as we no longer have to manually look for where a component is repeatedly used and hope that we don’t miss anything!
Control freaks? Or just common sense?
So maybe my initial title was a bit harsh. We aren’t control freaks – we just wanted a better solution to monitoring what was changing in our database ETL tool so we don’t end up like poor second-year Lea, staying up till the early hours of the morning having a near-breakdown over a misplaced semicolon. Now – thanks to the Matillion parser – we won’t.