August 2020 – Grinding Gears

What is a Business Intelligence Analyst?

Posted by Jack Gladas on 27 August 2020

The term Business Intelligence Analyst (or BI Analyst for short) can be a confusing one. Broadly speaking, the role of a BI Analyst can overlap with lots of other job titles that you might see out in the wild, such as:

Data Analyst
Data Scientist
Insight Analyst
Product Analyst
Marketing Analyst
Commercial Analyst
Reporting Analyst
MI Analyst
Web Analyst
CRM/Customer Analyst

In general – and especially in the data world – job titles can often be inconsistent and subjective. A Data Scientist at one company might do exactly the same work as a CRM Analyst at another company. Role descriptions offer a lot more insight into what a job title actually means, although there can even be some level of mismatch between a role description and what it entails! With all of this confusion in mind, it’s no wonder that people sometimes have no idea what a BI Analyst actually is!

So what does a BI Analyst actually do?

I can’t speak for every BI Analyst, but here’s my take on what the role entails here at FreeAgent:

We work with teams across the entire business (marketing, product, support, commercial etc) trying to help them to get as much as they can out of our data. We support the wider team in:

Interpreting data to inform decision-making
Deciding what success looks like in quantitative terms
Transforming data to be consumed by other tools
Building dashboards so that teams can digest information as easily as possible
Collecting new data that allows teams to answer new questions

I say we ‘support’ teams because we often don’t actually deal with the data analysis itself. The level of support we provide varies; it can be as little as giving somebody access to our data exploration tool (Looker) so that they can build their own dashboards or as in-depth as helping a team formulate and measure their OKRs.

But how does a BI Analyst differ from the other roles?

As I’ve already mentioned, BI Analyst roles can be described as everything from a Data Scientist to a CRM Analyst. Here are a few things that are key to FreeAgent’s definition of a BI Analyst:

We’re not department-specific

Unlike a Product Analyst, Marketing Analyst or Commercial Analyst, we work with all teams across the business. As a result, we have an understanding of every area across the business, but perhaps have a less specialist understanding of each area compared to those roles.

We don’t do all of the reporting ourselves

Where you might expect a Reporting Analyst to be the person building reports for stakeholders, we see ourselves as the ‘gatekeepers’ of the data – ensuring it’s all fit and ready for reporting. We then enable teams across the business to build their own bespoke reports. This often involves teaching teams about tools, data and data best practices.

We don’t build advanced statistical models

Whereas a Data Scientist’s day might be spent deep in statistical modelling, we’re less involved in the technical aspects of those models. However, we do tend to have a statistical literacy that would allow us to identify where such modelling might be applied, or translate the learnings of such modelling.

We own the ETL (extract, transform, load) process

While an Insight Analyst’s role might begin with some clean and tidy data, we’re fully responsible for taking some very raw data and using SQL-esque tools to transform it into a range of clean, tangible and useful data tables. In fact, our team is currently responsible for migrating our data warehouse and ETL infrastructure from its legacy setup to a Matillion/Redshift setup (which is no mean feat!).

We have our finger on the pulse at all times

While you might picture the stereotypical Data Analyst to be crunching numbers without too much involvement in the bigger picture, BI Analysts are embedded in the heart of the business and proactively identifying opportunities to get involved.

Again, it should be stressed that each of those roles can mean a wide range of things – I’m deliberately comparing against the more stereotypical versions of these roles!

Jack of all trades, master of some?

Reading all of the above, you might get the impression that a BI Analyst is a ‘jack of all trades, master of none’. While that might be a little harsh, there are some grains of truth there! Every BI Analyst in our team at FreeAgent has a completely different background, ranging from graphic design, physics, teaching and more. Each experience comes with its own unique skillset and expertise, and so each of us is definitely a ‘master of some’ trades, but definitely not all of them!

Broadly speaking, a BI Analyst needs some level of mastery in:

Using SQL (and similar tools) to transform data from one shape to another
Visualising data (with Looker, Tableau, Data Studio etc.) in a clear and coherent way
Understanding stakeholders’ business problems from a wide range of contexts, and translating them into something measurable
Exploring data to find trends and anomalies
Communicating insights from data back to stakeholders

Hopefully all of the above has given you a clearer idea of what a Business Intelligence Analyst actually does, and what we specifically do here at FreeAgent. If you have a different insight or perspective into any of the roles or terms I’ve mentioned in this article, please do let me know in the comments!

Transaction Taxonomy: Spending the Summer Studying SVMs

Posted by Michael Wilson on 24 August 2020

A company faces some unavoidably arduous tasks when taking control of their finances. One such task, which currently takes up a lot of time for our users, is explaining bank transactions. This is the process of assigning an accounting category to transactions, which is important both for internal reports generated by FreeAgent and for external submissions, for example to HMRC. At the end of June FreeAgent launched a suite of new automation features. As part of this release we have begun using a machine learning model to automatically explain our users’ bank transactions.

I previously wrote about starting a remote internship at FreeAgent this summer and how I’d be helping the data science team with developments to the above model.

The machine learning model has explained almost 100,000 transactions in the first two months after launching, marking them as ‘for approval’ in the Banking section of the app so that users get a chance to check them over. Over half of the explained transactions have gone through the approval stage with greater than 95% of those left unaltered. We will refer to this estimator of model performance as the precision: when the model makes a prediction, how often is it correct?

This is encouraging performance so far but there is room for improvement in terms of the volume of bank transactions the model attempts to explain. In the post I will give an overview of how the model works and how I’ve helped to increase its impact.

Current Model

The process of assigning one out of a given set of accounting categories to a bank transaction is a classification task, which we’ve chosen to tackle using a supervised-learning approach whereby we learn a function which maps from some input features to an output (accounting category) using example input-output pairs. The current model, referred to internally as ‘Banquo’, is a support vector machine (SVM) that takes example bank transaction descriptions and amounts as inputs, alongside the associated target accounting category we hope to learn how to predict. This information is used to find optimal decision boundaries for associating transactions with accounting categories.

An example of a linear SVM trained to separate the blue and red points. Each point has two input features. In this example the resulting optimal decision boundary is the straight line which best separates the two categories. If there were more than two input features then the separating boundary would need to be a hyperplane, but that’s much harder to visualise!

To simplify the problem, due to the high possibility of overlap between some of the 60+ standard accounting categories, the Banquo model in our initial launch to customers only attempts to explain transactions in four thematically distinct accounting categories: Accomodation and Meals, Bank/Finance charges, Insurance and Travel.

As mentioned above, the model inputs are the transaction description and amount for each bank transaction. To be able to feed these into the SVM, we apply preprocessing steps to normalise and extract the sign of the transaction amount and to represent the textual transaction description in a numerical form. This latter step draws on techniques from Natural Language Processing (NLP). We make use of the HashingVectorizer from scikit-learn to efficiently construct a token occurrence matrix from the input transaction descriptions.

The next stage of the pipeline is to train the SVM using the preprocessed inputs and associated accounting categories, which we have access to for 10s of millions of historical transactions explained by our users. We train an independent binary SVM for each of the four initial accounting categories mentioned above, whereby a boundary is positioned to separate as many transactions belonging to that category from the rest of the transactions as possible. This is known as the one-vs-rest approach to training a multiclass SVM.

The output of the model is a signed score for each of the categories, with the sign indicating the side of the boundary the transaction lies on and the magnitude of the score indicating the distance from the boundary for each respective category. We compare the maximum score with a fixed confidence threshold; if the threshold is exceeded then the transaction is labelled as the corresponding category.

Adding New Categories

My internship this summer has been centered around improving the performance of Banquo. Given the high precision of the model since launch to customers, the main area for improvement is the volume of transactions Banquo attempts to classify. One of the simplest ways to increase the coverage of the model is to add to the set of four accounting categories Banquo has been trained to make predictions for.

One of my biggest concerns when considering categories to add was introducing a category that could potentially surface conflicting scores with the existing categories – thus negatively impacting the current precision. To study the behaviour of the model when adding codes I split the historic transactions used to train the production model into a training and validation set; the validation set was formed of transactions from March and April 2020 and the training set was formed of transactions between January 2019 and February 2020.

Banquo is only sent transactions that have not already been explained by our hard-coded Guess rules. In order to target the most impactful categories, I checked which transactions in my validation set had already been explained by Guess. The following bar chart shows the top 15 categories which were left unexplained after Guess in March and April 2020.

A bar chart showing the top categories which are left unexplained by hard-coded Guess rules. Notable candidate categories include materials, computer software, accountancy fees and subcontractor costs. — It’s reassuring that three of the current predicted categories are well within the top 10, however we’d really like to be able to classify some of these other categories.

The goal of my study was to identify some categories which, when added, increase the total number of explained transactions without reducing the overall precision of the model. Some of the categories commonly left unexplained by Guess raised some alarm bells. In particular, Motor Expenses includes transactions with very similar descriptions to Travel, and possibly even Insurance. From the perspective of an end user adding this category would potentially lead to the undesirable scenario that the automation features could appear to be getting worse: “Why is this transaction no longer being explained correctly? It used to work fine!”

I trained new models for a handful of seemingly promising categories each of which predicted the current four accounting categories plus an additional candidate category. One of the candidate categories that looked really promising on the validation data was Accountancy Fees. The number of transactions correctly explained rose by about 8% when we added the category without the precision dropping.

This was exactly the kind of result I was looking for. Before adding this code to the production model, it was important to monitor the candidate model performance on incoming FreeAgent transactions. The current production model is served on an AWS SageMaker endpoint which is invoked by the FreeAgent app when transactions are imported by the user. We set up a candidate endpoint and sent it the same transactions as the live model in parallel. The resulting predictions are stored for monitoring rather than surfaced to users of the app.

We want the candidate model and current live model to both make the same predictions for the original four codes. In addition to this the candidate model should make precise predictions for transactions belonging to the Accountancy Fees category. Provided that the candidate model fulfills these criteria we will look to promote the candidate model to the live endpoint in the coming weeks, which is a really exciting and impactful addition to have made.

Uprooting the Binary Tree

Posted by Pat George on 19 August 2020

You’re a talented software engineer looking for your next position. Maybe you’re eyeing the FAANGs so you start researching how they do their technical interviews and you discover they like to ask algorithm-type questions – Reverse a linked list in place, Print the kth level of Pascal’s Triangle, Does string contain substring – and have you design Dropbox, Twitter, or some other system of their choosing.

Then you apply to FreeAgent. You go through our entire interview process and think “Where was the algorithm question? And why didn’t I have to design Facebook? The big companies are asking these questions, why didn’t you?” It’s true, we don’t ask algorithm questions during our interview process nor do we ask you to design an arbitrary system. Let me explain why.

The Dreaded Algorithm Interview

I’ll be honest with you. I used to ask an algorithm question. Return the longest English word one can generate from a 4×4 grid of letters (the Boggle challenge – feel free to email us your solution). And you know what? Some of us liked the question because, like algorithm questions in general, it can tell you a lot about a candidate in a short amount of time – how they approach a problem, their critical thinking skills, how vocal they are during their thought process, if they understand recursion and time and space complexity, can they code up a simple solution quickly, what they might be like to work with. But these types of questions are stressful for a candidate and can require a lot of studying and practice ahead of time.

At some point we realized one’s ability to solve the Boggle challenge didn’t correspond to the person’s success here. After thinking about why there seemed to be no correlation we determined that of all the things the algorithm-type questions can tell you about a candidate, we cared about slightly different things. Additionally we decided that if none of us liked doing them during our own interviews, why would we subject our future colleagues to them?

While working at FreeAgent you won’t be writing bespoke algorithms or trying to find ways to improve QuickSort, and we’ve never had to reverse a binary tree as part of our day-to-day work. We deal with a complex subject matter (taxes and finance) so being able to model, design, and code an extendable, readable, and maintainable solution is more important to us than if you can remember how to write a depth first search. Taxes are hard enough without throwing spaghetti code into the mix. This means the algorithm questions weren’t a good fit for us or our prospective coworkers.

We give candidates a take-home Ruby challenge instead – write an API given a list of requirements. The challenge shouldn’t take too long but knowing that we all have lives outside of work and hoping to reduce the stress of interviewing, we give candidates as much time as they need to deliver a solution they’re proud of. Once we receive a code submission a group of FreeAgent engineers will evaluate it. Your solution and any follow-up conversations are what help tell us how you approach a problem, how good your coding skills are, and – what we found more important – how you model and test a solution.

The System Design Interview

Many companies have a system design interview where they ask you to design something of their choosing. These are good questions because the interviewer can really dig into specific areas to see your thought process. Do you attempt to simplify the problem before you get started? What assumptions do you make about the requirements? What’s your naive approach? How would you then scale your solution? What are the security risks? These are real questions you should be asking yourself while working on a project so it’s important for a company to get an idea of your abilities in these areas.

Instead of giving you a system to design we change it up a bit in our technical interview by asking you to describe and whiteboard* a system you’ve worked on in the past. We’ll look for you to answer questions like why the system was designed that way, what tradeoffs did you have to make and why you made them. We may challenge you on some of those decisions or offer alternative approaches to see how you go about evaluating them.

We feel we get the same information out of this stage as other companies while using a less contrived situation. The downside is it’s harder on us as interviewers. If I ask you to design a system of my choosing, I’ll likely have a “right” answer in mind already. Our method forces us to get to know your system and then question it. The upside is it helps us see how you approach real-world projects and how you are at explaining a system to someone unfamiliar with it. Both of which are things you may have to do during your actual day-to-day job here.

*Note: If we happen to be in the midst of a global pandemic, instead of having the technical interview in our Edinburgh office, we’ll use video conferencing (but what are the chances of that happening?).

Trade Offs

Now, dear reader, you may be thinking it takes more effort on our part to answer similar questions about a candidate than if we’d used algorithm and contrived system design questions. You’re not wrong. Between getting a group to read through, understand, evaluate, and then chat with you about your code challenge submission, and learning and questioning a system design you’ve chosen we put a lot of time into our hiring process. We’re ok with that. Not only do we want to make the interviews as stress-free as possible for our candidates, we feel this process gives us a more accurate gauge of a candidate’s abilities as they relate to FreeAgent. And, that way, we get candidates that are the best fit for us.

FreeAgent is blessed to be of a size where we can still take the time to evaluate each candidate in this manner. Maybe someday we’ll no longer have the ability to do so but until then put away your copy of Cracking the Coding Interview and instead immerse yourself in the world of good programming practices and system design. We look forward to seeing what you come up with.

The perils of a bad date

Posted by Lorna Noble on 13 August 2020

As you would expect from small business accounting software like FreeAgent, we deal with a lot of dates and times. Ruby on Rails has some really useful helper methods — but there are also a few unexpected quirks in the way that different date and time classes interact. One of those quirks sadly caught me out recently.

Picture the scene

I’ve written tests, the CI pipeline has passed and my pull request has been accepted. The only thing standing between my code and production is some pre-production testing (checks done by another engineer to make sure everything looks good before the code goes live). I’m feeling good!

Then I get a Slack message: Hey Lorna, why am I getting “TypeError: can’t convert Date into an exact number” when I run your code?

Say what? Ruby isn’t supposed to throw type errors! All I’m doing is subtracting one date from another…

To understand what happened, I’ll need to take you on a brief tour of Ruby and Rails and the different ways in which they treat the concept of duration. The problem isn’t unique to Ruby, or Rails, of course — you’ve probably bumped into something similar if you’ve ever tried to work with dates and times in any programming language. And let’s not even mention timezones…

Dates in Ruby

If you’re working with dates in Ruby you have several classes to choose from: Date, Time, and DateTime. Each has advantages and limitations, but one significant difference is the units in which they operate.

Date
Units of days with no concept of hours, minutes or seconds

Time
Units of seconds, counting since the start of the Unix Epoch

DateTime
Units of days, but does include the concept of hours, minutes and seconds
Calendar-based, and understands the concept of calendar reform

So far, so good. I had chosen to use Date and was subtracting one date from another in order to calculate a number of months:

def self.rounded_up_number_of_months_between(first_date, second_date)
      ((second_date - first_date).to_f / 365 * 12).ceil
end

Dates in Rails

Rails ActiveSupport includes a number of helper methods to make working with duration more intuitive. You can use phrases like 1.day or 3.days.ago or 7.days.from_now, or even concepts like beginning_of_quarter.

What caught me out was the difference between date + 3.days and 3.days.from_now.

3.days returns an ActiveSupport::Duration which is compatible with Date, Time & DateTime.

And that’s where it gets interesting.

Adding or subtracting an ActiveSupport::Duration will return the same type of object.

Date.today - 1.day => Date

DateTime.now + 1.minute => DateTime

Time.now - 1.minute => Time

The exception is where you are modifying a Date by a number of hours, minutes or seconds when it will be coerced into ActiveSupport::TimeWithZone (which implements the same interface as Ruby Time).

Date.today - 1.minute => ActiveSupport::TimeWithZone

3.days.from_now returns an ActiveSupport::TimeWithZone which behaves like Time, rather than Date.

This is where my error was coming from.

My colleague had passed in 90.days.from_now instead of Date.today + 90.days in order to set the end date for the calculation, and Date.today to set the start date.

(end_date - start_date) was thus (ActiveSupport::TimeWithZone - Date)

The reason for the error is that Date and Time use different units: days for the Date and seconds for the Time. As a result, Ruby doesn’t know how to interpret what you’ve passed in, and throws a TypeError. In the same way, you can’t subtract a Time from a DateTime.

Date - Time => TypeError: expected numeric

DateTime - Time => TypeError: expected numeric

Time - Date => TypeError: can't convert Date into an exact number

This isn’t the only problem with mixing classes either. Because of the differing units between DateTime, Date and Time, you might end up with a perfectly valid calculation but an answer which makes no sense in the context of your code.

Date - DateTime => Rational number of days

Time - DateTime => Float number of seconds

TL;DR

In modern Ruby, Time and DateTime are largely indistinguishable. So much so that Ruby offers a helpful rubric for choosing which class to use. In contrast Date and Time are significantly different, and ActiveSupport adds another layer of complexity.

References and Further Reading

https://www.rubyguides.com/2015/12/ruby-time/

https://medium.com/@muesingb/whats-the-difference-between-time-and-datetime-in-ruby-fa3cc844c9d7

https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time

https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team