In the vast expanse of the universe, I, a humble data science intern, set out on a mission to improve a classification model. As I delved deeper into the data, I encountered anomalies and outliers that threatened to disrupt my analysis. But with the guidance of my mentors and the help of advanced data tools, I navigated through the stars and uncovered the hidden patterns that led to breakthrough insights. Join me as I share my experiences and lessons learned in this epic journey of data exploration and discovery.
As I mentioned in my previous blog post, I was a researcher working on the edge of physics and chemistry. My PhD was about the interaction between ultrafast laser beams and potential drug materials. I modelled the interaction, produced datasets and tortured the data statistically to uncover the secrets. Sounds like a data job, doesn’t it? After working years in the pharma industry, I decided to steer my career to data science and started an MSc degree in Operational Research with Data Science. With days and months of Python coding, statistics courses and lots of projects, I felt that I was ready to get my hands dirty with a real data science project. When I saw the internship advertisement for FreeAgent, I thought this was the opportunity I was looking for and I applied. Even before I started my internship, FreeAgent helped me feel like I was one of them. As soon as I received my interview result, I was invited to the Christmas party and had the opportunity to meet the team members there in person.
From day one of the internship, I felt the advantages of working in a well-established data science team. The structure of the team and the job responsibilities were so clear, and if you wanted to know something, there was always a person who had a deep knowledge of it. People were always willing to help each other to make the system work smoothly.
I am a Windows user, so getting used to using a Mac was a thing. Moreover, working in a terminal was a challenge, even though I’m old enough to know the times when there was only MS-DOS.
I needed to brush up my SQL skills too, as SQL is a must for a data scientist. Along the way during my internship, Git and GitHub became my good friends and I ended up writing a blog post about them.
So, what project did I actually work on? The team had been working on a model to classify financial data into different categories based on a range of inputs. My goal was to figure out how the model actually works. Thinking of the model as a black box, I was to open this black box and see how the model decides to put things in different categories. I was particularly focused on the model errors, why it was making them and how it could be improved.
I started with an exploratory analysis of the training data. I plotted histograms for each input to the model to understand their distribution. Then I turned to how the model performed for some evaluation data and started to dig deeper. I looked at the performance of the model for different thresholds and for each category it was trying to predict. Starting from the performance of each category, I tried to figure out why some categories were better than others. The model confuses some categories often because they have similar definitions.
Then I realised that some of the values in the data were ridiculous. Some inputs had values in the hundreds of billions pounds when that didn’t make sense given the context. In data science speak, this means we had some outliers in the data. Outliers can affect the model in various ways depending on exactly how it works. However, there are broadly two types of outliers:
1. Real outliers where the values are genuine and we need to think about how to accommodate them in our model. This could be through how we transform the data, the type of model we use etc.
2. Outliers due to data issues. These don’t represent genuine data that we want to train our model on. Some kind of error has occurred that we need to handle. Given how unrealistic the outliers I found were, it felt like they were of this type.
The next step after realising the problem was to solve it. Based on the distribution of the data, I figured out the outliers in the data and removed them from our data set. Even without any outliers, there was a question about how best to scale the data. It’s common to preprocess the inputs to a machine learning model in various ways. One thing we often do is scale numerical features to a given range or distribution. Lots of algorithms benefit from having data scaled, but there are several options for how to do it, so I tried a couple of scalers! The preliminary results started to look promising. However, since I was testing lots of options on the evaluation data, I was in danger of overfitting. Overfitting here refers to the model being overly tuned to the evaluation data in a way that wouldn’t generalise to new data. This would make the model rigid and not applicable for other data. To avoid this, we use different methods and one of them is cross-validation. Cross-validation avoids evaluating loads of different models on the same evaluation data by instead partitioning the training dataset into multiple training and test sets and training models for each partition. I took my new approach back to the training data and applied k-fold cross-validation. Among all the scalers I tried, the best working of them was the Quantile Tranformer from SckitLearn. QT is a non-linear transformer which maps a feature to a given distribution (either uniform or Gaussian) and is robust to outliers.
Now it was time to present my results back to the team. The team asked questions and made suggestions. Every question and every suggestion showed me another aspect of my analysis that I had not considered yet. It was an invaluable learning process.
When the team decided my suggested changes were ready to implement in the code base, another challenge was started for me. I brushed up my coding skills and Git became my best friend. I celebrated my first pull request on the model codebase. To be sure that my code worked as expected, I added unit tests using pytest and ran them in a Docker container, which ended up with 13 failures in the beginning. Going one by one and solving the issues was fun. I like good puzzles.
Finally, the code passed all tests and a new version of the model was trained on AWS using GitHub actions to trigger a SageMaker pipeline for the model. It was a complicated process, but knowing that your code is in the cloud now was another kind of satisfaction.
Finally, It was time to announce that this brand new model with all the changes made could correctly classify around 5% more cases than the previous model! This is a good step for the project, and the model I developed will be launched to test customers soon. Seeing that our work is a part of the FreeAgent app and helping real people in their life cannot be described with words. All the hard work we have done is worth it, and it’s addictive. Another question popped up immediately in my mind: ‘what will we do next?’
Data science is like exploring the final frontier, with endless possibilities and uncharted territories waiting to be discovered. My internship has been an adventure that has allowed me to develop new skills, push my boundaries, and make valuable contributions. As I look back on my journey, I am grateful for the opportunity to be part of the data science crew, exploring new frontiers and discovering hidden insights. As I prepare to boldly go where no intern has gone before, I am confident that this experience will propel me forward as I continue to navigate through the vast universe of data science. Live long and prosper!