Trading the lab coat for the computer – my journey to data science

Posted by on February 8, 2022

I became a data scientist just over two years ago. It’s not that long since I traded my lab coat for a computer job, and a few people have asked me how I made the transition, if I could help someone get into data or if I could just answer some questions about what it’s like to work in data. So I figured I would put it all together in writing… 

Introspection

I should start by saying that the key to changing careers successfully is to find the method that will work for you. What I describe next worked for me, but your experience may differ 😁. 

It all started in my old job where I worked as a lab manager. I have a chemical engineering degree, a PhD in organic chemistry and I had a successful career in the biotechnology industry that lasted for 12 years. I loved the people I worked with, but in the end I took on too much, to the point where stress and anxiety were impacting my life.

I realised that I was not going to get better until I knew there was an end to it, so I resigned. I didn’t know what I was going to do next, I only knew I had my three months notice to work and that I was doing the right thing for my wellbeing. 

I am a woman who loves a plan (and a list). For the first time in a while I could do some thinking, and I thought about my job. I had been there a while and there was plenty I loved about my career as a scientist. Two themes jumped to the top of the list: 

  • Problem solving

When a customer came with an idea, we had to spend time translating it into a question we could answer and come up with a project plan. I enjoyed spending time discussing what they were trying to do, the aim of the project and figuring out the best approach to answer their question. 

  • Analysing and presenting results

I could spend hours on a spreadsheet. I loved putting together data for customers, finding the best way to present my results and communicating these results in a way that would make sense to them. 

Looking at this, the answer was obvious: I needed to change careers and to go into data. 

Practicalities

Here is the “how” I did the transition into data and got started. Leaving a career to start something new is quite a daunting experience and a rollercoaster of emotions. There will be a different story for everyone who has changed careers, but I hope there will be some useful starting points for anyone who is looking to do something similar. 

Some research

You can’t take the researcher out of the scientist. 

A job can sound a lot more glamorous on paper than it does in real life, so I started by doing some research on what a data job meant.

Looking at my connections on LinkedIn, I realised I knew a few data scientists, and I started by asking them a lot of questions. I wanted to have an idea of what the day-to-day of a data scientist was. Every job has some tasks that are more interesting than others, and I wanted to know if the reality of the job lined up with what I thought the job entailed.

I also wanted to find out people’s journey into their current roles. Very few data scientists come from a computer science or data science background. It was very interesting to see how many different paths one could take to get into the field. 

I then spent time figuring out which skills I was missing. Problem solving, analytics and communications were skills I had been fortunate enough to develop during my career as a scientist. The one skill I was missing was really obvious; all the analysis I had worked on was with small datasets and it was all in Excel. To get into a technical role, it was key that I learned the technical skills, so I needed to find a way to learn how to code. 

Learning to code

My journey to getting these technical skills started online. There is a lot of introductory material available, and I started with The Data Science Course 2019: Complete Data Science Bootcamp on Udemy. It gives a very good overview of the various practical concepts of data science and it is hands-on. I combined it with the Complete 2021 Web Development Bootcamp (also on Udemy) to start working on my own projects. I thought of a question that I would be interested to answer, worked on finding the data and put some results together. 

This approach worked for the first couple of weeks, but I found what I was doing was too superficial. I decided to sign up for CodeClan’s intensive data analysis course to learn how to use code in data analytics. It was tough but I thoroughly enjoyed it.

I learned how to work with code to explore data, carry out statistical analysis and present data. I also started working with machine learning algorithms; I only scraped the surface of it, but I was hooked. I learned a lot in the space of 14 weeks and I felt ready to start my new career. 

After Retraining

The first step of finding a job in data is to filter through the various job adverts. The same role can have a different meaning for different organisations. There are also so many buzzwords in data that figuring out which employers need a data scientist, or know what skills they need, can be quite challenging. 

It’s also important to find out how much data awareness there is in the company by asking questions during an interview. What problems are they trying to solve? Who will take action on the results? What is their current data infrastructure? These are key questions that will help to find out if your skills are a good match for their business.

I feel very fortunate to be in my current role. I work in a collaborative environment where I can learn from everyone in the data team. The work we do is valued and key to the business. I am lucky to work with cutting-edge technology but in an organisation that also gives me time to work on my personal development. My soft skills are also valued. I have already been able to share some ideas about processes at FreeAgent with the skills I gained as a senior manager in my former career. 

Conclusions

I would say it took me a year from the moment I decided to change my career to the moment I got my first job in data. I did most of the preliminary research into what I wanted to do and what a data job looked like while I was still employed, but retraining and learning new technical skills was a full-time job. 

Starting something new was hard and scary but I enjoyed it all the way through. Two years on and I still think it is the best decision I have ever made. I now work in a field that I am passionate about. I have been lucky to find a job with a company that puts people first and has values very close to mine. 

My job is to learn new things every day – it really doesn’t get any better than this!

Training Hugging Face models in parallel with SageMaker

Posted by on February 2, 2022

In a previous post we described how to fine-tune a BERT model for bank transaction categorisation using Amazon SageMaker and the Hugging Face transformers library. We’ve come along a fair bit since then and are preparing to get this model into production. One of the steps for doing this was to train the model on a much larger dataset than we used during initial experimentation.

Training time scales linearly with the size of the training dataset so we were expecting ~30 hours to train our model on an ml.p3.2xlarge instance (single GPU). To get this down to something more reasonable we wanted to use an instance with multiple GPUs running our training in parallel, hopefully keeping costs roughly constant. A quick search brought us to this distributed training section in the Hugging Face docs.

Of the two distributed training strategies described we decided to take the data parallel approach. Model parallel training, which divides layers of a single copy of the model between GPUs, seemed conceptually a bit trickier and more suited to the case where the model has a very large number of parameters and is therefore difficult to fit into memory. 

Since we are using the Trainer API, enabling data parallel training was as easy as adding the following distribution parameter to our Hugging Face Estimator:

distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

and choosing one of the supported instance types. For our availability zone this is currently one of ml.p3.16xlarge, ml.p3dn.24xlarge or ml.p4d.24xlarge, all with 8 GPUs per instance. This gives the following Estimator:

huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='s3://bucket/scripts/',
    image_uri=TRAIN_IMAGE,
    instance_type='ml.p3.16xlarge',
    instance_count=1,
    role=role,
    hyperparameters = hyperparameters,
    distribution = distribution,
)


When this distribution parameter is added, the Trainer will make use of the SageMaker Distributed Data Parallel Library.

The data parallel approach is to create an identical copy of the model and training script for each GPU and distribute the data in each training batch between these. These ‘batch shards’ are processed in parallel to produce separate sets of parameter updates, which are averaged using an AllReduce algorithm and applied across the model copies. This can be extended to multiple nodes (instances), each with multiple GPUs, to build up a data parallel cluster.

In practice there is some overhead associated with the intercommunication between nodes/GPUs, but the reduction in training time scales almost linearly with the number of available GPUs. In our case we cut that 30 hour estimate down to just over 5 hours, meaning it cost us roughly the same amount to train our model in a fraction of the time.