Twice a year at FreeAgent we hold Hack Days, where the entire company gets a couple of days to work on whatever takes their fancy. We’ve had all sorts of projects over the years, from a text recognition “Receipt Scanner” which automatically creates an expense based on the image of a receipt, to an “Office Olympics” with a frantic bean bag race to decide the winner.
This year, I decided to build something to help us declutter our internal reporting tool – Looker – by automatically identifying and deleting unused content. This would make it quicker and easier for our users to find useful content and also speed up our Looker development (since our changes won’t be validated against as much content). Lea, our Data Science Intern, was interested in the project too and we started by putting together a quick list of things that the Looker Content Cleaner (LCC) should be able to do. This was our list of requirements:
- Identify content that had not been accessed in 90+ days and move it to the Trash folder (“delete”)
- Permanently delete (“purge”) anything that has been sitting in the Trash for 90+ days
- Allow for some content to be marked as “do not delete”, regardless of usage figures (some content is only used if something breaks, or is looked at annually, etc.)
- Retain a record of “what and when” has been deleted
- Run automatically (so we don’t need to remember to run it)
Now that we knew what we wanted the LCC to do, we created a quick plan in Notion. We used this plan to track our progress and make sure we both were working in tandem. That was vital since Hack Days is only two days long – we needed to be quick and efficient! We decided to build the LCC in our ETL tool, Matillion, for the following reasons:
- we could self-serve, without having to involve other teams or spin up new infrastructure (important during Hack Days!)
- it comes with an API Query component so we could interact with Looker’s API
- it would allow us to write results to our database (Matillion’s bread and butter)
- you can schedule jobs to run at the click of a button
Now we had the plan, it was time to build!
There was a portion of the build which was spent configuring the Matillion to Looker API connection. This wasn’t difficult, but it was a bit fiddly, so I’ll perhaps cover it in a subsequent blog post. Right now, I want to focus on how we fulfilled the requirements we set for the tool itself.
Requirement 1: Identify unused content and move it to the Trash folder
Identifying unused content was quite simple using Looker’s system__activity explore, so we created a saved report (a look) of content that had not been accessed for 90+ days:
Now we had a list of content that we wanted to delete, the next step was to move this content to the Trash folder. Content in the Trash is still recoverable – you can think of this as a soft delete. So what did the Matillion job look like that performed this? Well, once we had a token to connect to the API, we simply:
- ran the look of unused content to store its results in a database table
- compared that unused list against our do not delete list of exceptions, to build a to be deleted list of content
- iterated over the content in the to be deleted list, set it’s
deletedproperty to true and logged the action in a table (so we have an audit trail)
Because the “Unused Content” look contained both dashboards and looks, we needed to make one of two API calls, either
Requirement 2: Empty long-deleted content from the Trash
Now we could move unused content to the Trash, we wanted to empty the Trash of long-deleted content (90+ days counted as “long-deleted” for our purposes). This purge operation is permanent, so we needed to be careful here! To do this, we performed the following steps:
- made an API call to search for deleted dashboards and looks
- identified the content that had been deleted more than 90 days ago. This gave us a list of content to be purged
- compared that list against our do not delete list of exceptions (not strictly needed, but a sensible bit of defensive programming)
- iterated over the list of content, purged it and logged the action in our audit table
In practice, the API endpoints for looks and dashboards are both separate, so we performed these steps separately for both types of content:
Requirement 3: Allow for us to mark content as do not delete
In Looker, we have content that is accessed infrequently (or maybe never), but that we still want to retain. For instance, we have a dashboard which displays details on any data load failures and we only look at this in emergencies. We don’t want to delete this dashboard if we don’t view it for 90 days, as we may need it again in the future.
To ensure content such as this wasn’t caught up in the deletes, we stored details in an exceptions table, which looks like this:
This table was then used in Matillion to ensure that content on the do not delete list was not deleted.
Requirement 4: Retain a record of “what and when”
Creating this audit trail was pretty simple. Every time we performed a delete or purge action, we wrote the following details to a table in Redshift:
- The type of content (look or dashboard)
- The ID of the content
- The title of the content
- What action was performed against the content (delete or purge)
- When the action took place
Requirement 5: Run automatically
Scheduling a job in Matillion takes under a minute. We scheduled the LCC to run just after midnight. The next morning, I checked the audit log and found that the LCC had purged a look that had been deleted 90 days ago. Hurrah!
Bonus Requirement: Can we report on what the LCC has done in Looker?
As everything went so well, we quickly put together a Looker explore showing what the LCC had done. It simply reads the output log, but it allows our users to see what content has been deleted, and for us to report on its actions in aggregate:
This is what the final LCC looks like in the Matillion UI:
The benefit of both choosing the right tools and having a clear plan was apparent – we were able to get so much done in a short space of time. It also demonstrated that we can make life a little bit easier for our team and users of our data by harnessing the power of the Looker API – this is something I hope to do more with in future.
In terms of the numbers, the LCC has been running in Production for a few days now, with no issues. So far we have deleted over 100 pieces of content and purged over 250. This represents a ~25% reduction in the number of dashboards and ~33% reduction in the number of looks. All things considered, the building of the Looker Content Cleaner has been a great success!