The Data Science Internship Chronicles: A Starfleet-worthy Tale of Numeric Exploration

Posted by on March 30, 2023

In the vast expanse of the universe, I, a humble data science intern, set out on a mission to improve a classification model. As I delved deeper into the data, I encountered anomalies and outliers that threatened to disrupt my analysis. But with the guidance of my mentors and the help of advanced data tools, I navigated through the stars and uncovered the hidden patterns that led to breakthrough insights. Join me as I share my experiences and lessons learned in this epic journey of data exploration and discovery.

As I mentioned in my previous blog post, I was a researcher working on the edge of physics and chemistry. My PhD was about the interaction between ultrafast laser beams and potential drug materials. I modelled the interaction, produced  datasets and tortured the data statistically to uncover the secrets. Sounds like a data job, doesn’t it?  After working years in the pharma industry, I decided to steer my career to data science and started an MSc degree in Operational Research with Data Science. With days and months of Python coding, statistics courses and lots of projects, I felt that I was ready to get my hands dirty with a real data science project.  When I saw the internship advertisement for FreeAgent, I thought this was the opportunity I was looking for and I applied. Even before I started my internship, FreeAgent helped me feel like I was one of them. As soon as I received my interview result, I was invited to the Christmas party and had the opportunity to meet the team members there in person.

From day one of the internship, I felt the advantages of working in a well-established data science team. The structure of the team and the job responsibilities were so clear, and if you wanted to know something, there was always a person who had a deep knowledge of it. People were always willing to help each other to make the system work smoothly.

I am a Windows user, so getting used to using a Mac was a thing. Moreover, working in a terminal was a challenge, even though I’m old enough to know the times when there was only MS-DOS. 

I needed to brush up my SQL skills too, as SQL is a must for a data scientist. Along the way during my internship, Git and GitHub became my good friends and I ended up writing a blog post about them.

So, what project did I actually work on? The team had been working on a model to classify financial data into different categories based on a range of inputs. My goal was to figure out how the model actually works. Thinking of the model as a black box, I was to open this black box and see how the model decides to put things in different categories. I was particularly focused on the model errors, why it was making them and how it could be improved.

I started with an exploratory analysis of the training data. I plotted histograms for each input to the model to understand their distribution. Then I turned to how the model performed for some evaluation data and started to dig deeper. I looked at the performance of the model for different thresholds and for each category it was trying to predict. Starting from the performance of each category, I tried to figure out why some categories were better than others. The model confuses some categories often because they have similar definitions. 

Then I realised that some of the values in the data were ridiculous. Some inputs had values in the hundreds of billions pounds when that didn’t make sense given the context. In data science speak, this means we had some outliers in the data. Outliers can affect the model in various ways depending on exactly how it works. However, there are broadly two types of outliers:

1. Real outliers where the values are genuine and we need to think about how to accommodate them in our model. This could be through how we transform the data, the type of model we use etc.

2. Outliers due to data issues. These don’t represent genuine data that we want to train our model on. Some kind of error has occurred that we need to handle. Given how unrealistic the outliers I found were, it felt like they were of this type.

The next step after realising the problem was to solve it. Based on the distribution of the data, I figured out the outliers in the data and removed them from our data set. Even without any outliers, there was a question about how best to scale the data. It’s common to preprocess the inputs to a machine learning model in various ways. One thing we often do is scale numerical features to a given range or distribution. Lots of algorithms benefit from having data scaled, but there are several options for how to do it, so I tried a couple of scalers! The preliminary results started to look promising. However, since I was testing lots of options on the evaluation data, I was in danger of overfitting. Overfitting here refers to the model being overly tuned to the evaluation data in a way that wouldn’t generalise to new data. This would make the model rigid and not applicable for other data. To avoid this, we use different methods and one of them is cross-validation. Cross-validation avoids evaluating loads of different models on the same evaluation data by instead partitioning the training dataset into multiple training and test sets and training models for each partition. I took my new approach back to the training data and applied k-fold cross-validation. Among all the scalers I tried, the best working of them was the Quantile Tranformer from SckitLearn. QT is a non-linear transformer which maps a feature to a given distribution (either uniform or Gaussian) and is robust to outliers.

Now it was time to present my results back to the team. The team asked questions and made suggestions. Every question and every suggestion showed me another aspect of my analysis that I had not considered yet. It was an invaluable learning process.

When the team decided my suggested changes were ready to implement in the code base, another challenge was started for me. I brushed up my coding skills and Git became my best friend. I celebrated my first pull request on the model codebase. To be sure that my code worked as expected, I added unit tests using pytest and ran them in a Docker container, which ended up with 13 failures in the beginning. Going one by one and solving the issues was fun. I like good puzzles. 

Finally, the code passed all tests and a new version of the model was trained on AWS using GitHub actions to trigger a SageMaker pipeline for the model. It was a complicated process, but knowing that your code is in the cloud now was another kind of satisfaction. 

Finally, It was time to announce that this brand new model with all the changes made could correctly classify around 5% more cases than the previous model! This is a good step for the project, and the model I developed will be launched to test customers soon. Seeing that our work is a part of the FreeAgent app and helping real people in their life cannot be described with words. All the hard work we have done is worth it, and it’s addictive. Another question popped up immediately in my mind: ‘what will we do next?’

Data science is like exploring the final frontier, with endless possibilities and uncharted territories waiting to be discovered. My internship has been an adventure that has allowed me to develop new skills, push my boundaries, and make valuable contributions. As I look back on my journey, I am grateful for the opportunity to be part of the data science crew, exploring new frontiers and discovering hidden insights. As I prepare to boldly go where no intern has gone before, I am confident that this experience will propel me forward as I continue to navigate through the vast universe of data science. Live long and prosper!

Mindfulness with GitHub

Posted by on March 20, 2023

I was a researcher in chemistry in my previous career, so I have a habit of labelling everything. It is important in chemistry to be organised; you don’t want to mix unknown liquids in unlabeled beakers. Can you guess why? BOOM!

I apply this habit in every area of my life. Now, everything has a place and is clearly labelled. I have a place for vertically striped socks and a separate one for horizontally striped ones. (No, don’t call the ambulance yet, this is just a joke.) And the freezer… I label each item according to its baking date, whichever one was baked first should be eaten first. I have colour codes for who did the baking too (maybe we can call the ambulance now).

However, when you’re looking through my computer, it’s a mess. First drafts, version 10 of first drafts, corrections from advisor version 12, suggestions from colleague version 1001, etc. I was always stressed when I sat at the computer and started working. Which document should I work on? Which person made what contributions to which version of the draft? When I got more involved in coding projects, this got even messier and it became impossible to track and organise code changes. I wanted to cry…

Then, there was light. I discovered GitHub. I had always heard of it, but I thought it was just a place where you can save and share your documents. No! Not that simple. Git lets you track the version changes and people’s contributions in detail. It’s a must-use tool for someone working on a project in collaboration with a team. Typically, this would be a software development project, but it could also be a book, an academic paper or any collection of files; in short, anything that changes in time and needs to be tracked. What GitHub does is help you keep organised and track the changes in the project.

Want to know some basics?

Let’s start with the ‘Git’ in GitHub. They are not the same things. Git is an open-source version control tool. GitHub provides hosting for your codes or documents which integrate with Git. In other words: Git is necessary for using GitHub, but technically you could use Git without GitHub. GitHub is not the only web-based hosting platform; the other popular options are GitLab and BitBucket.

 Git is the system that GitHub uses to track changes in your files. What GitHub provides is lots of useful ways to share and collaborate on projects with people. So what is a project? A Git project consists of files in some directories called a repository or repo. Git records information historically about all the changes to your project and stores this information in a .git directory, which you should not delete or edit. 

To make a brand new project repo, you can type:

git init directory_name

Let’s make an apple pie as a project. You type git init applepie to start your project. This command will create a repo named ‘applepie’. You can add a recipe in a notebook called ‘recipe.txt’ and you have “pickles, apples, nuts, flour and butter” in this text file. If you would like to check the status of the project:

git status

What does ‘status of a project’ mean? It shows which files changed in the project. This command will return a list of the files that have been modified since the changes were saved last. It also shows you which files are in this staging area, and which files have changes that haven’t yet been put there.

Staging area?

Photo by Patrick Fore on Unsplash

A staging area is where Git stores the files with changes you may want to commit (where commit means saving onto the repo), but have not yet committed. 

Let’s add our recipe.txt to it.

git add recipe.txt
git commit -m “starting recipe”

Now it’s time to try your recipe. Go bake an apple pie using recipe.txt. If you decide didn’t like the apple pie with pickles, you can remove pickles from your recipe.txt file and type:

git add recipe.txt to stage your file and git commit -m “pickles removed” takes your files from the staging area and puts them in your repo.

You experiment with different spices, such as cinnamon. You go off and make your pie and taste it and want to update the recipe.txt. You may also want a new file called ‘tasting_notes.txt’.

git add tasting_notes.txt
git commit -m “taste notes for cinnamon added”

When you try git status again, you see that recipe.txt is not staged yet after adding cinnamon. You should type:

git add recipe.txt
git commit -m “cinnamon added to recipe”

From the start of the project, we did three commits for recipe.txt. If you want to compare two commits, then type:

git diff ID1..ID2

When you have a commit, Git gives a particular ID number for this commit. This ID number is known as a hash. You can use git log to see the hashes and commit messages for the commits in your project, as shown below. The long list of letters and numbers at the top is the hash. 

commit ab8883e8a6bfa873d44616a0f356125dbaccd9ea
Date: Sun Mar 19 11:07:32 2023 -0400
  pickles removed
commit 2242bd761bbeafb9fc82e33aa5dad966adfe5409
Date: Sun Mar 19 11:07:32 2023 -0400
  cinnamon added to recipe

git show ab8883e would show you the changes that have been done in this particular commit.

If you accidentally stage a file and want to take it back:

git reset HEAD

Or you can use:

git checkout -- filename

This command removes the file from the staging area.

What is a branch?

As I mentioned before, when you’re working on a project, you always end up with different versions of your work. Git allows you to create branches to track the different versions of your project. 

So far you’ve been making all your changes to the main branch, but you might want to try different versions of your apple pie recipe at the same time or keep a ‘clean’ version on the main branch you’re definitely happy with. For the next iteration of the recipe, you can create a branch.

Let’s try a new recipe for a friend who has a nut allergy.

git checkout -b nut-free-pie

This command will create a new branch. Every repo has a main branch. If you want to see other branches, the command you type is:

git branch

The one with a star next to it is where you are now. To jump from one branch to another:

git checkout nut-free-pie

If you want to compare the differences between the two branches, type:

git diff applepie..nut-free-pie

If you think it’s OK to merge branches, use:

git merge applepie nut-free-pie 

This command lets you write a message to explain why you’re doing this. 

If you want to collaborate with your friends on the greatest apple pie recipe of all time, this is where GitHub comes in handy. Let’s create a remote repo:

git remote add origin ‘URL_of_greatest_applepie’

You and your friends can create your own branches and work on them separately. If you ever wanted to see which changes are made by you and which ones from your friend on a file, you can type:

git annotate filename

If you’ve made a change to the recipe but you want your friend to review it before it gets merged into the main branch and becomes part of the final recipe, you can submit your changes through a pull request.

To do this, you can use GitHub.com. You open the main page of your repo. Choose the branch that you’re working on. Click ‘pull request’ button. You should fill in the title and description parts clearly so that all collaborators understand the changes you have made. Now your changes are ready for review, just press ‘Create Pull Request’. 

Your collaborators will then be notified about your changes. That way one of your friends can check the changes before adding them to the final recipe.

And if your friend has a branch called `add-nuts` then you can do git pull add-nuts to pull down their changes and try out their recipe for yourself.

Finally, your friends said they created a perfect recipe for a peach pie. If you want to try it, for getting a copy of this repo:

git clone ‘URL_of_peachpie’

These are the basic commands that will help you survive. 

So now, we can relax. All the project files are not messy any more and our brains are at ease. Use GitHub for a healthy mind!

Take a breath and say git pull, exhale git push.