Clean house: clear mind. Clean data: clear findings.

Posted by on July 30, 2018

Soon after settling in at FreeAgent and getting to grips with my role as a data science intern, I got the opportunity to present some of the data that I had been working on at a ‘town hall’, a company-wide weekly meeting where everyone gets together to present their work, share news and pitch ideas. The data I presented was attitudinal survey data from accountancy practices that had contracts with FreeAgent. This might sound fairly simple, but don’t be fooled! I would like to explain the process of what had to happen before I could even think about presenting this information: data cleaning.

Janitorial justification

In 2016, the International Business Machines (IBM) estimated that the US lost $3 trillion in GDP due to poor quality data and one in three business leaders did not trust the data sources they were using to make decisions1. One way that these losses can be minimised with the best possible quality of data preserved is data cleaning.
Sometimes referred to as data cleansing, data scrubbing or data wash, data cleaning is defined as ‘the process of detecting, diagnosing and editing faulty data’2. All three of these steps are equally important – I’ll use the analogy of cleaning a kitchen to explain. Not taking the time to detect errors would be like declaring your kitchen clean when you haven’t checked the bins – just because you don’t look for it doesn’t mean it’s not there. Not diagnosing the errors would be like not asking anyone why the bins are full – if you don’t find out what/who is responsible then nothing will change in the future. Not editing the errors would be like checking the bins, seeing they were full but going for a nap instead of emptying them – acknowledging there is a problem but ignoring it!

Mopping up can be a daunting task…

Considering the contaminants

The methods with which information is collected, recorded, stored and retrieved can all introduce errors into the data, which means that every dataset has its own individual data cleaning challenges. Although some datasets (like kitchens) are easier to clean than others, the vast majority of datasets contain some form of errors (even new kitchens have dust!). Errors may appear in many different forms, caused by many different reasons.

Sometimes data cleaning is not about removing errors but rather making data interpretable. In large datasets, free text boxes are usually ignored because they are notoriously difficult to interpret. However, free text boxes are valuable sources of information and can provide clues about the types of error that may be encountered during the data cleaning process. Even if it is difficult to clean the free text boxes themselves, manual visualisation of their contents can be very useful during the data cleaning process.

Although it is impossible to identify every error or discrepancy that might occur, the following errors are common across many different types of datasets:

  1. Duplications: where one row contains identical/similar information to another row.
  2. ‘NA’ misclassifications: where empty values are misclassified as known values or vice versa
  3. Erroneous answers: where incorrect values are entered, either accidentally or deliberately (a common effect of compulsory questions that cannot be answered)
  4. White space and alphabetical case errors: where values appear to mean the same thing but are classified differently due to white space or alphabetical case
  5. Spelling mistakes, typing errors or special character errors: where values appear to mean the same thing but are classified differently due to spelling mistakes, typing errors or special characters
  6. Inconsistent time formats: where dates/times have been inputted in several different time formats, making it impossible to make time-based calculations
  7. Data merging ambiguity: where information that could be recorded in different columns is merged into one column, making the information difficult to interpret
  8. Data recording ambiguity: where information is recorded in different ways (e.g. giving a range of values rather than a single value)

Finally, data cleaning should be undertaken carefully so that unanticipated biases are not introduced into the dataset. It is good practice to analyse the data before and after the data cleaning process to see how this affects the results.

Winston always takes care when cleaning…

Next time: why, where, who and what… but how?

We’ve looked at why data cleaning is important (it improves data quality), where the errors exist (most datasets), who needs to consider data cleaning (everyone that performs data analysis) and what errors exist (a huge variety!) – but how do we actually go about cleaning the data? This can be a daunting task if you don’t know any data cleaning tools (just as cleaning a kitchen would be if you didn’t have any materials!) so in my next blog post, I’ll demonstrate some R-based data cleaning techniques for the most common types of error.

References

  1. IBM. 2016. The four V’s of big data. Retrieved from: http://www.ibmbigdatahub.com/infographic/four-vs-big-data. [Accessed 25 July 2018]
  2. Van Den Broeck, J., Cunningham, S. A., Eeckels, R., & Herbst, K. (2005). Data cleaning: Detecting, diagnosing, and editing data abnormalities. PLoS Medicine, 2(10), 0966–0970. https://doi.org/10.1371/journal.pmed.002026

Castles, canals and coffee: my first week as a FreeAgent data science intern

Posted by on July 19, 2018

My regular job is as a PhD student specialising in veterinary biology but this summer I have an amazing opportunity to be a data science intern at FreeAgent. If you are wondering why a biologist came to work at an accountancy software company, this is because of the particular branch of biology that I study: ‘epidemiology’. An epidemiologist typically uses data to understand ‘the distribution and determinants of health-related states or events’1 and the statistical techniques that they use are often also used by data scientists. For example, at university I use data about dogs to predict their health and at FreeAgent I will be using data about customers to predict their success with the software. Despite these similarities, I am still going to have so much to learn and when I arrived at FreeAgent I was really excited to get stuck into my new role.

Inductions and introductions

After a quick chat with Dr. Dave Evans (FreeAgent Analytics Team Lead – hereafter referred to as ‘Dave’), I was greeted by the other members of the data science team: David (a permanent member) and Hannah (another data science intern). I was delighted to find my very own FreeAgent branded hoodie, t-shirt, sweets, pen, notebook and stickers waiting for me. Then, I saw the view from my desk! Edinburgh Castle in the sunshine? What a start!

Incredible scenes!

During the week, I attended various inductions: health and safety, office, sales, people operations, support, communications and about the company. I must admit, I normally find inductions boring but these were no average inductions. I got to meet the head of each department and Ed Molyneux, the CEO and founder of FreeAgent. Everyone presented their roles enthusiastically and it was genuinely interesting, inspiring and helped me understand the company.

Winston the FreeAgent mascot!

Dave took me round and introduced me to everyone and although I am still struggling to remember many names, I’m pretty sure that at least 50% (trust me! I’m a data scientist) of the male employees are called David, which makes it slightly easier. During my travels around the two-floor office I also became familiar with Winston, the animated FreeAgent mascot! As I am guilty of being a crazy cat lady, I was very happy to find I would have a feline friend for the summer (watch out for him in my blog posts).

Delicious delicacies

Food and drink featured as a large part of the week: South African wraps, ciabattas and salads from the many local food havens and various gins, beers and wines after work at the bars nearby. I had my first experience of the data science team’s Wednesday coffee at ‘The Counter’, a canal boat cafe. I don’t really drink coffee but the view itself was enough to float my boat! On Thursday morning, we had bagels and iced coffee on the balcony while we did a ‘stand up’ meeting. Stand up is a short meeting where everyone literally ‘stands up’ and says what they achieved the day before and what their goals are for the current day. It is a concept of the Agile Method which is commonly used in the project management of software development. Finally, I can’t forget the fantastic FreeAgent Friday catered lunch – a selection of gourmet salads did not fail to impress.

You can’t beat bagels and iced coffee!

Project progress

When I wasn’t meeting people, eating/drinking or generally being blown away by the company ethos I had time to do some background reading, learn more about my project and learn to use some new software and web tools such as.Google Drive/Docs/Slides/Sheets, Trello, Slack and Amazon Web Services/Redshift/SageMaker. Towards the end of the week, I was able to extract some customer attitudinal data from Redshift and begin data exploration and cleaning: the process of removing errors from data to ensure data quality. My first week mainly involved getting to grips with the software and data and planning what I would be doing in upcoming weeks, which I will share more about in future blog posts.

Town hall talks

My favourite part of the week was without doubt the company-wide ‘town hall’ meeting on Friday afternoon. We grabbed a beer/wine/soft drink from the fridge and listened to presentations from employees in different departments. What struck me most was the great atmosphere: everyone presented their work enthusiastically, listened intently, asked questions respectfully and chatted afterwards as friends. I feel privileged to get the chance to work with such a friendly bunch!

A typical town hall audience

References

  1. WHO. 2018. Epidemiology. Retrieved from: http://www.who.int/topics/epidemiology/en/. [Accessed 11/07/2018]