What a data science degree doesn’t teach you

Posted by on July 28, 2022

When I enrolled on my data science master’s degree I had limited statistical and coding knowledge. This course was designed to teach these skills from the bottom up. Having now worked as a software engineering intern, I have come to realise a lot of things were missed.

Moving beyond ‘if it works… it works!’

Learning to code can seem very daunting. There are so many resources and even languages. Where do you even start? 

When working on assignments at university the overwhelming attitude was ‘if it works… it works’. In our lectures and workshops we were shown brief tutorials on how to approach certain problems and were encouraged to apply this knowledge to make creative solutions to our assignments. 

However, we were never encouraged to think about efficiency. In fact, I cannot recall a single lecture that discussed this and, on reflection, I do not know why. If prompted to think about efficiency, students would have thought conscientiously about their work and produced clearer, faster solutions. It would also mean examiners would have to spend less time marking each assignment. Failing to teach efficiency is inefficient in itself. 

Unfortunately, with university assignments there is also very little opportunity to work on feedback. You hand in an assignment, receive a mark and probably never think about it again. A data science degree did not teach me how to work with feedback to improve solutions I create. 

While at FreeAgent I have been given many opportunities to share my work, explain what I have been doing, receive feedback and then work on this. In a working environment my mindset has moved from my university approach of “if it works… it works” towards “it works… but what if we did this instead?”

Test?! But I can see it works

The second thing a Data Science degree does not teach you is the value in testing. Before starting at FreeAgent I would approach an assignment head-on and figure out my solution as I was working. I would spend hours working on a large project and fixing errors as they occurred. 

A different approach is encouraged at FreeAgent. When a new problem arises, the first step is to think about what your desired outcome is. The creation of a test means you take the time to think about what you want to get. The solution created is then based on achieving that final result. Taking the time to create a test first means that you are able to put more thought into what you want to achieve. At university, I was used to creating a solution and reaching an endpoint that was “good enough” to satisfy the problem at hand. 

I’ll just add some comments 

Throughout my master’s degree, I remember handing in many assignments thinking “good luck making sense of that”. I would submit work that was long, inefficient and unorganised. Ultimately I was not producing readable code. My solution to this usually included excessive commenting and there were probably only a few lines of code that were free of a lengthy description.

This worked at university and I never received any feedback about the readability of my code. Thinking about this now, maybe this was because my work would only ever be seen by one examiner. Realistically they had an abundance of assignments to mark and a restricted time to do this in. Maybe this meant all they had time to do was run the code, check it worked, give it a quick skim and assign a mark. Again the attitude of “if it works… it works” was rewarded.

Across the last couple of months at FreeAgent I have had many opportunities to discuss my work with other members in the team and have collaborated on problems in pairing sessions. Looking at other people’s work and discussing my own work has taught me about coding conventions, classes and packages that I had never used before. If time did not go into making code readable, sharing resources would be incredibly difficult and it would take hours to begin to understand what’s going on before you can even get started. 

Please don’t ask me to explain what this does

When learning to code it is easy to google what you want to do, or copy and paste an error message into google. This usually works pretty well and most of the time you get the solution or answer you want. Any problems you come across while learning to code someone has definitely encountered before. However, as a problem-solving approach this does not really teach you anything. For me, this meant I would hand in assignments and submit code that I did not really understand. Equally, if I came across the same problem I probably would not know how to fix it and would be straight back on google. 

While at university, if a peer asked me a question about my work sometimes I would say: “I don’t know how it works, but it works”. I got away with this for so long because it was very rare that I had to work with someone else or in a group. Most work was independent. 

At FreeAgent I have to ask someone for help with something nearly every day, and to do this I need to be able to explain the work I have done so far. This constant communication means I receive consistent feedback to progress my work, but I am also able to ask questions if I come across something that I don’t understand.

On reflection, I learnt a lot from my master’s degree and it was an excellent start in learning how to code. However, it is undeniable that I have learnt so much more when given the opportunity to collaborate with others, discuss my work and view what others have been working on.

Getting started with Jupyter Notebook

Posted by on July 12, 2022

Jupyter Notebook is a development environment that runs in your web browser and can be used with several languages, including R and Python. In this blog post, we’ll look at some of the benefits of using Jupyter Notebook and how to start using it with Python. 

Benefits of Jupyter Notebook

Chunking code into cells

Instead of having to write code in large flat files, developers can use Jupyter Notebook to chunk documents up into cells that can contain either code or formatted text. Running these cells separately allows for quicker iterations and more targeted edits, without the need for complex dependencies between separate files. 

Referencing variables between cells

Being able to reference variables between cells means that developers can run a computationally heavy cell only once. They then have that result to use in any other cell, and can quickly make edits and test different approaches in those cheaper cells.

Making results easy to share and understand

The ability to intersperse nicely formatted text cells between code cells allows developers to use Jupyter Notebook to make good-looking documents for sharing. It also allows them to put forward clearer and more readable explanations than inline comments alone would support. This makes Jupyter Notebook great for tutorials; in fact, a university tutorial is where I first encountered it. 

Stepping through script one segment at a time

Jupyter Notebook’s cell structure naturally segments code and the output is displayed directly beneath each cell, allowing developers to display intermediate results easily. It also makes cause-and-effect clearer to see than it would be in one large output from a long chunk of code.

Setting up Jupyter Notebook

You can try out Jupyter Notebook from your web browser without any setup process required. However, I would say that the best experience comes with setting up Jupyter Notebook on your machine. As long as you have both Python and a package manager installed, it’s quick to do and requires only a few commands. 

To follow these instructions, you will need to be working on a Linux or Mac machine, and have both Python and the ‘Poetry’ package manager installed. Find out more about using Poetry to manage Python dependencies across multiple projects.

  1. In your terminal, navigate to the folder that you want to contain your Python project
  2. Run poetry new <PROJECT NAME>. This will set up the project folder. Then navigate to it using cd <PROJECT NAME>.
  3. Run poetry add -D jupyter to download Jupyter and everything required to run it.
  4. If you want to use outside packages in the environment you have created, run poetry add <PACKAGE NAME>. I recommend numpy, pandas, matplotlib and sklearn.
  5. Run poetry run jupyter notebook. This will open Jupyter in your preferred browser automatically and may take a few moments to complete.
  6. Jupyter should now be up and running. To get a notebook up, navigate to ‘New -> Python 3 (ipykernel)’.

Reading & using a Jupyter Notebook

How to read a notebook

A Jupyter notebook is split into cells and each cell can contain text, code or images. The ‘Example Heading’ cell in the example below, for instance, is a Markdown cell that employs some basic text formatting.

Beneath the ‘Example Heading’ cell is an example code cell. It contains some standard Python, which performs some arithmetic and prints the result. When you run the cell, you see the output from the code (7 in this example) visible directly beneath the code cell. The blue outline in the example above signifies that a cell is currently selected. If a cell is currently being edited, a green outline will be visible instead. 

Below this is another code cell. Notice that I can reference a variable defined in an earlier cell. However, for this to work the cell that defines the variable must be run at least once before the cell that references it. The number next to the cell shows its latest completed run. Another way of telling when a cell has completed running is that it will change from [*] during runtime to a number.

How to use a notebook

Click on the grey part of a code cell to enter ‘edit’ mode. Write the code you want to run and press ‘ctrl + ENTER’ to run that cell. I typically use ‘shift + ENTER’ as it not only runs the cell, but also selects the next cell down or creates a new one if there is not one already beneath it. 

Markdown cells are run the same way as code cells. To change a code cell to a Markdown cell, select it in the command mode (signified by the blue outline) by clicking on the cell – but outside the grey area – then press ‘M’. 

To delete a cell, select it in command mode and double tap ‘D

To force-halt a cell you can press the button signified by the ‘Stop’ icon at the top of the page.

It’s a good idea to take a look at the other options in the notebook. You’ll find examples of both interesting and typical commands under the headers at the top of the page. For example saving is found under ‘File’.

Tips for using Jupyter Notebook

Over the years, I’ve learned a lot about using Jupyter Notebook, including how to avoid some common pitfalls. Here are my top tips based on everything I’ve learned so far:

  • Running <PACKAGE NAME>? in a code cell brings up a description of the package. This also works with class names.
  • Try to keep your cells in the order in which you want them to be run. This way, when you reopen the notebook you can just fire through the cells using ‘shift + ENTER
  • If something is behaving weirdly, try resetting your notebook. You can then run each cell in order and see the result. It can be easy to get lost in the variable environment you’ve built up, as it may contain strange assignments and accidental reassignments. In these cases, nothing beats a switch on-switch off.
  • You can also reset a notebook by stopping the kernel running in the terminal with ‘crtlC’. However, you will then have to re-run poetry run jupyter notebook.
  • Be aware that closing the tab does not stop the notebook running. To stop the notebook you can navigate to the initial browser tab opened by poetry run jupyter notebook, select the notebook that’s running and then press ‘shutdown’.
  • Be careful about which variable names you use, as you might accidentally rewrite a value assigned in an earlier cell. If you go back and run cells that depend on that value, you might get unexpected results.
  • This is a specific matplotlib consideration, but when you are setting axis titles make sure to run plt.xlabel(‘xlabel’) and not plt.xlabel = ‘xlabel’, which will make the xlabel object a string and you’ll have to reset the notebook to get it back to normal.
  • When you want to restart a notebook, make sure you’re shutting it down properly. Remember that closing and reopening the tab does not reset the notebook.
  • When you make a change to a function or class definition, make sure that you re-run the cell and that it successfully completes without any errors. Otherwise, the other cells will be unaware of the change and will still work off of the old definition. 

That’s it for this introduction to getting started with Jupyter Notebook. I hope I’ve interested you in trying it out and that you can have fun using it!