Factories: don’t stop production!

Posted by on August 29, 2023

Why this post?

Have you ever come across a situation where you need to write a test that uses some model objects, but found that those have endless dependencies on the existence of other objects, from the same model or otherwise? Have you ever come across a test where you only care about a specific attribute of a model object, but you find yourself having to populate every single one of the object’s attributes for that test object to be valid? Chances are, you have (or, at least, I hope so, or it might mean you’re not testing at all!). It’s always a pain to have to create 5, 10, 20 objects, and have to specify every single attribute for all those objects, each time you want to test some small new feature or bug fix.

If you ever came across this problem, you may have been lucky enough – after a quick, hungry search landing you on StackOverflow (for example) – to find out about fixtures. Oh, they are amazing things brought to you directly by Rails! They allow you to define and provide some fixed test data for your application, with which your database can be pre-populated before each test. This way, you do not have to constantly create your test data in-place, in an explicit way. You can read more about fixtures in the official Fixtures API documentation.

As you have probably guessed, this makes things a lot easier for us when writing tests. However, fixtures have one important problem: they are too… well, fixed. A single fixture that we define will always create the (pretty much) exact same object. For example, if we create a blog post fixture, it will always have the same title, content and published_at date (for example, two days ago). The amount of flexibility they give us is not very large. What if you want a blog post that has a different published_at (for example, 1 month from now)? You can create the object from one of the existing fixtures and update the instance after. Luckily for you, this is a simple example. What if you wanted to change the author of the blog post? You would have to create another user and then update the instance. With more complex models, this can easily turn into one of two situations:

  • complicated tests to modify fixture-generated stuff, which defeats the purpose of fixtures in the first place.
  • an unsustainable number of fixtures to account for all permutations and combinations of attributes that we might need.

So, you decide there has to be something better out in the world, so you hop back onto StackOverflow and discover: factory_bot!

So, what is factory_bot?

What is factory_bot, and how is it different from fixtures? From the official factory_bot repo, it it is defined like this:

factory_bot is a fixtures replacement with a straightforward definition syntax, support for multiple build strategies (saved instances, unsaved instances, attribute hashes, and stubbed objects), and support for multiple factories for the same class (user, admin_user, and so on), including factory inheritance.

factory_bot allows us to create data factories, which are a sort of blueprint for creating an object (or a set of objects) with predefined attributes. Similarly to fixtures, factories give us quick and easy access to the data we need to run our tests, but with one fundamental difference (among many others): fixtures define fixed objects that populate the database before every test, whereas we can use factories to generate specific objects flexibly whenever we need them for a particular test. They achieve this by having a default state in which they are created, but which can be changed in many ways that we will discuss further ahead. 

Another massively useful feature that factory_bot has is its support for different build strategies. These are a great help in improving test performance and speed in cases where interaction with the database is not strictly required, and we will also discuss them in coming sections. The main idea is that we have the ability to create objects that are not persisted to the database, but stored in memory as objects or as collections of attributes instead.

So, let’s hop right to it!

Basic project to get started

In this post, we will be using a basic todo app as an example, so we can write tests for it and show the power of factories. If you want to learn how this project was set up, you can check out this document. Otherwise, you can directly clone the project and go to the point where all the basic setup was completed:

git clone https://github.com/bfrangi/rails-todo.git
cd rails-todo
git reset --hard c396394

Feel free to follow along with a project of your own if you prefer!

Let’s write some tests!

Ok, so let’s write some tests for our project. We will start by creating some controller specs for our TasksController in the spec/controllers/tasks_contorller_spec.rb file. Just before we do that, we need to install another useful gem that will give us some methods to help us in writing our tests. To do this, add gem ‘rails-controller-testing’ to your Gemfile, in the development and test group:

# Gemfile

group :development, :test do
  ...
  gem 'rails-controller-testing'
end

And run bundler to install the gem:

bundle install

Also, to be able to perform sign-ins in our controller specs, we need to create a quick helper for our tests:

# spec/support/authentication_spec_helper.rb

module AuthenticationSpecHelper
  def sign_in(user)
    if user.nil?
      allow(request.env['warden']).to receive(:authenticate!)
      .and_throw(:warden, { scope: :user })
      allow(controller).to receive(:current_user).and_return(nil)
    else
      allow(request.env['warden']).to receive(:authenticate!)
      .and_return(user)
      allow(controller).to receive(:current_user).and_return(user)
    end
  end
end

RSpec.configure do |config|
  config.include AuthenticationSpecHelper, type: :controller
end

Awesome! Let’s write a few tests for the index page. 

To start off, we can check that when the user is logged in, only their own tasks are shown in the index view. We can also check that the right template is rendered. When the user is not logged in, we can check that they are redirected to the sign in page:

# spec/controllers/tasks_controller_spec.rb

require "rails_helper"
require "support/authentication_spec_helper"

describe TasksController do
  context "when logged in" do
    before :each do
      @logged_in_user = User.create(
        email: "some.email@some.domain",
        password: "some_password"
      )
      @other_user = User.create(
        email: "some.other.email@some.domain",
        password: "some_other_password"
      )

      sign_in @logged_in_user
    end

    describe "GET index" do
      it "assigns @tasks correctly" do
        task = @logged_in_user.tasks.create(content: "some content")
        task_by_other_user = @other_user.tasks.create(content: "some different content")
        get :index

        expect(assigns(:tasks)).to eq([task])
      end

      it "renders the index template" do
        get :index

        expect(response).to render_template("index")
      end
    end
  end

  context "when not logged in" do
    describe "GET index" do
      it "redirects to the sign_in template" do
        get :index

        expect(response).to redirect_to("/users/sign_in")
      end
    end
  end
end

And we can run the tests with:

bundle exec rspec -fd ./spec/controllers/tasks_controller_spec.rb

A couple of things to note here:

  • First of all, we are requiring the support/authentication_spec_helper at the top, which is giving us access to the sign_in helper to be able to log our user in.
  • assigns comes from the rails-controller-testing gem we just installed.
  • We are not using factories yet! We’ve written the test without use of factories so we can see the difference when we do introduce them. Let’s do that in the next section!

How can I create a Factory?

Right, so let’s begin by writing a User factory.  In spec/factories/users.rb, we need to add some stuff to the default User factory so that the user object is initialised correctly. In this case, we require the email and the password, which we can define like this:

# spec/factories/users.rb

FactoryBot.define do
  factory :user do
    email { "some.email@some.domain" }
    password { "some_password" }
  end
end

Cool, let’s modify our existing tests to use the User factory. We can do that by changing the user definitions in the before block of our controller specs: 

# spec/controllers/tasks_controller_spec.rb

    ...
    before :each do
      @logged_in_user = create(:user)
      @other_user = create(:user)

      sign_in @logged_in_user
    end

The create(:user) statements call the User factory so that it generates a User object. However, when we run with the same command from before, we will get a validation error telling us that the email has already been taken. Why does this happen? Well, the way the User factory is currently defined means that every time we call it, a new user will be created with the email some.email@some.domain.  So when we try to create the second user, there will already be a user with that email address, which will cause the error. 

How do we fix this? (drumroll 🥁) I present to you (drumroll getting louder 🥁 🥁) sequences!

What are sequences?

Sequences are a great way to ensure that we comply with uniqueness constraints for factory fields that have them. In our example, there cannot be more than one user with a specific email address. We can ensure this condition is met by using a sequence like this:

# spec/factories/users.rb

FactoryBot.define do
  factory :user do
    sequence(:email) { |n| "some.email#{n}@some.domain" }
    password { "some_password" }
  end
end

The way sequences work is that they create a counter for that specific field, which increments with every instance of that factory that we create. In the case of our email example, n will initially be 1, so our first user will have an email equal to some.email1@some.domain. When we create our second user, its email will be some.email2@some.domain

Note that, for our second it block inside the first describe, we are executing the before block again, so we are creating two new users different from those used in the first it block. This means that the users will now have the emails some.email3@some.domain and some.email4@some.domain

If we now run our specs again, we will see that the validation error from before is now gone!

Factories with associations

So, we’ve created a User factory and used it to improve our specs. However, we can still improve them further by creating a Task factory. In our factories directory, we already have a Task factory with some default content, but we still need to specify a user. The difference between the user field and the other fields we’ve seen so far is that this is an association. We need an actual user object to be able to populate this. How do we define associations? Like this:

# spec/factories/tasks.rb

FactoryBot.define do
  factory :task do
    content { "MyText" }
    association :user, factory: :user
  end
end

There is also a simplification we can make when the factory is named the same as the association:

# spec/factories/tasks.rb

FactoryBot.define do
  factory :task do
    content { "MyText" }
    user
  end
end

Every time we create a task using this factory, a new User object will also be created and associated with the task. This means that we can simplify our tests even further by not defining the second user explicitly:

# spec/controllers/tasks_controller_spec.rb

    ...
    before :each do
      @user = create(:user)

      sign_in @user
    end

    describe "GET index" do
      it "assigns @tasks correctly" do
        task = create(:task, user: @user)
        task_by_other_user = create(:task)
        get :index

        expect(assigns(:tasks)).to eq([task])
      end
      ...

Ok, this is great, and our tests are already looking much easier to read. But, there is one more thing about factories which will enable us to take our specs to the next level: nested factories! Let’s have a look at those in the next section.

What are nested factories?

We have a User factory, which is great. But, most times, we are going to need a user that has some tasks. This is where nested factories come in. A nested factory is a factory that inherits from and extends another factory. We can use this to create a :user_with_tasks factory that creates a User object with some tasks:

# spec/factories/users.rb

FactoryBot.define do
  factory :user do
    sequence(:email) { |n| "some.email#{n}@some.domain" }
    password { "some_password" }

    factory :user_with_tasks do
      transient do
        tasks_count { 5 }
      end

      after(:create) do |user, evaluator|
        create_list(:task, evaluator.tasks_count, user: user)
      end
    end
  end
end

Woah, wait a minute, what do we have here? Let’s break it down:

  • First, we have another factory definition within the :user factory. Inside that, we have a transient block. A transient block allows us to pass certain parameters to the factory that are not an attribute on the model. This allows us to pass in the tasks_count parameter, to specify how many tasks we want that user to have. The default is set to 5. Technically, we wouldn’t need to have this block, and we could simply hard-code the number of tasks. But, hey, factories are supposed to be flexible!
  • Lastly, we have an after(:create) callback. This is a block of code that will be executed only after the User object has been created. We cannot have a foreign key on the tasks to a user that does not exist. This is why we need to create the user first, and then add the tasks to it.

Nice, so what would our tests look like now?

# spec/controllers/tasks_controller_spec.rb

    ...
    before :each do
      @user = create(:user_with_tasks)
      @other_user = create(:user_with_tasks)

      sign_in @user
    end

    describe "GET index" do
      it "assigns @tasks correctly" do
        get :index
               
        expect(assigns(:tasks)).to eq(@user.tasks)
      end
      ...

Amazing, this is so much easier to understand and shorter too! Ok, let’s complicate things just a little more.

Let’s complicate things a little more!

Ok, so normally a task can either be completed or not. So far, we have nothing in our app to set that state. To record whether a task is completed or not, we need a new column in our Task table:

./bin/rails g migration add_completed_to_tasks completed:datetime

This will generate a new migration in our db/migrate folder, adding a new datetime column that will be either nil – when the task is not completed – or it will hold a date and time – the completion date and time for the task. If we open the migrate folder, we can edit the migration to make the field nullable and give the default of nil:

# db/migrate/<some_timestamp>_add_completed_to_tasks.rb

class AddCompletedToTasks < ActiveRecord::Migration[7.0]
  def change
    add_column :tasks, :completed, :datetime, nullable: true, default: nil
  end
end

And we can now migrate:

./bin/rails db:migrate

To enable our users to mark the tasks as completed, we can add add the following in app/views/tasks/_form.html.erb:

# app/views/tasks/_form.html.erb

    ...
    <%= form.text_area :content %>
    <%= form.label :completed, style: "display: block" %>
    <%= form.datetime_field :completed %>
  </div>
  ...

This gives the following result in the create task form:

And we need to permit the :completed parameter to be set by the user in the controller:

# app/views/tasks/_form.html.erb

  ...
  def task_params
    params.require(:task).permit(:content, :completed)
  end
  ...

We can also add a couple of scopes to our Task model to make it easier to get all completed/incomplete tasks:

# app/models/task.rb

class Task < ApplicationRecord
  scope :completed, -> { where.not(completed: nil) }
  scope :not_completed, -> { where(completed: nil) }

  validates :content, presence: true
  belongs_to :user
end

Ok, so we can now write some specs for this new feature. For example, we could test the scopes to see if they return the correct tasks:

# spec/models/task_spec.rb

require 'rails_helper'

RSpec.describe Task, type: :model do
  describe "scopes" do
    describe ".completed" do
      it "returns only completed tasks" do
        completed_task = create(:task, completed: Time.now)
        not_completed_task = create(:task, completed: nil)
        
        expect(Task.completed).to eq([completed_task])
      end
    end

    describe ".not_completed" do
      it "returns only not completed tasks" do
        completed_task = create(:task, completed: Time.now)
        not_completed_task = create(:task, completed: nil)

        expect(Task.not_completed).to eq([not_completed_task])
      end
    end
  end
end

And we can run them with the command:

bundle exec rspec -fd ./spec/models/task_spec.rb

Let’s see how traits can help us to simplify this spec!

What are traits?

Traits are extra indications that we can pass into our factory to change the way it behaves. They are useful when we have small-ish characteristics (or set of attribute values) that we want our object to have, which may define a particular state, for example. They avoid us having to create a whole new factory when we only have a small difference with the “basic” factory, avoiding duplication. In our example, we can create a :completed trait that sets the completed attribute to true:

# spec/factories/tasks.rb

FactoryBot.define do
  factory :task do
    content { "MyText" }
    user
    completed { nil }

    trait :completed do
      completed { 1.day.ago }
    end
  end
end

Then our tests become:

# spec/models/task_spec.rb

require 'rails_helper'

RSpec.describe Task, type: :model do
  describe "scopes" do
    describe ".completed" do
      it "returns only completed tasks" do
        completed_task = create(:task, :completed)
        not_completed_task = create(:task)

        expect(Task.completed).to eq([completed_task])
      end
    end

    describe ".not_completed" do
      it "returns only not completed tasks" do
        completed_task = create(:task, :completed)
        not_completed_task = create(:task)

        expect(Task.not_completed).to eq([not_completed_task])
      end
    end
  end
end

Well done, you now master the basics of factory bot! There are so many more things you can do with this amazing library, but these are some of the ones that are most commonly used. In the following section, we will see a short note on build strategies and how you can use them to speed up your tests.

Build strategies

Build strategies are different ways that you can construct objects from a factory. For example, some of the most commonly used strategies are:

  • create – This strategy creates an object and saves it to the database. This strategy invokes, in order, the after_build, before_create, to_create, and after_create hooks.
  • build – This strategy creates an object as you would with .new, but does not save it to the database. As a result, all database-related things such as IDs are not provided for the built object. This strategy invokes the after_build hook.
  • build_stubbed – This strategy creates a fake ActiveRecord object. In other words, it is an object that pretends to be persisted to the database (has ID, created_at, updated_at, etc.) but is not actually saved to the database at all.

These strategies each have their advantages and disadvantages. For example, the build and build_stubbed strategies do not require interaction with the database, which means they are much faster. Tests using these strategies instead of the create strategy will take less time to run. However, sometimes we cannot get away with not persisting objects, for example when we want to test scopes.

One case in which we can use the build strategy in our tests is to test the completed/not completed state of our tasks. At the moment, we consider any task to be completed if it has a completed date different from nil. However, it would not make sense to consider a task to be completed if its completed date is in the future. We can add a simple method in our Task model to determine if the task is really completed or not (we also need to update our scopes):

# app/models/task.rb

class Task < ApplicationRecord
  scope :completed, lambda {
    where('completed <= ?', Time.now)
  }
  scope :not_completed, lambda {
    where('completed IS NULL OR completed > ?', Time.now)
  }

  validates :content, presence: true
  belongs_to :user
    
  def completed?
    !completed.nil? && completed <= Time.now
  end
end

We could now update the specs for the scopes, but we will focus on adding new specs for the completed? method so we can see some of the build strategies in action:

# spec/models/task_spec.rb

  ...    
  describe "methods" do
    describe ".completed?" do
      it "returns false when completed is set to nil" do
        not_completed_task = build(:task)

        expect(not_completed_task.completed?).to eq(false)
      end

      it "returns false when completed is a date in the future" do
        not_completed_task = build(:task, completed: 1.day.from_now)

        expect(not_completed_task.completed?).to eq(false)
      end

      it "returns true when completed is a date in the past" do
        completed_task = build(:task, :completed)

        expect(completed_task.completed?).to eq(true)
      end
    end
  end
end

If we now run and time this block of specs 10 times with the build strategy and another 10 times with the create strategy:

for i in {1..10}; do bundle exec rspec -fd ./spec/models/task_spec.rb:24; done

We see that the average time taken with build is noticeably smaller than that taken with create (take into account that these times include loading the file). With more complex tests, imagine how important that may be! In my case:

  • Average time taken with create: 2164 ms
  • Average time taken with build: 2030 ms

That’s a difference of 0.134 seconds for just three simple tests!

Note: if you want to try timing the commands yourself, this might be useful for that.

Conclusion

So, in this post you’ve learnt how to use factories to make your specs shorter, more concise, and more readable, in a way in which the things (model attributes, states, etc.) that are important to the specific test case are made obvious by the code itself, whereas everything that is necessary but not relevant to the specific test case is done behind the scenes by factory_bot. You’ve also learnt to use factories to speed up your tests by not having to persist model objects to the database if that is not needed. As a bonus, you’ve now started a cool tasks project that will help you keep much more organised in the future! Feel free to extend it and improve it in any way you can think (and don’t forget to keep up the coverage 😁). In case you want the complete project, check out the GitHub repository here.

References

Creating a simple to do app using rails

Adding associations to models (StackOverflow)

Authenticating pages (GoRails)

Getting started with factory_bot (Thoughtbot GitHub)

Simulating a login with Rspec (StackOverflow)

Getting Sequential: A look into a Factory Bot Anti-pattern (Thoughtbot)

Transient blocks in factory_bot factories (StackOverflow)

Build strategies (Thoughtbot)

How to survive imposter syndrome in your software engineering internship

Posted by on August 22, 2023

Hi there! My name is Fiona, and I am an intern at FreeAgent. I got my offer and dodged returning to the monotony of bar work that I endured last summer. However, I was only in my second year of studying Computer Science. How was I chosen from what must have been a sea of applicants? They must have overlooked someone! Had I lucked out? In hindsight, this was a classic case of imposter syndrome. Fortunately, my internship experience here has helped me lose the misconception that I was not a ‘good programmer’. Rather, what I think is a good programmer has changed from what I thought in university.  

Together over alone

University-wise, programming was often a solo venture. The drill was you were given an independent coding problem and a due date to get to the bottom of it. How do you go about solving it? For me, this meant searching everything down to the bare bones, from Stack Overflow threads to the rabbit hole of course readings. I could spend hours twiddling with one line of code only to realise the problem lay elsewhere! Or, after nitpicking the documentation my lecturer recommended, I would realise that I was looking in the wrong place for my answer. That’s why after all the hardship of figuring out the problem, I came away with an answer that felt right. There was, however, the grading step. These grades sometimes felt like a punch to the gut. You could make one minor mistake and that would be a mark off. I would kick myself over small things. “This was so obvious, how could you not notice?” Any pride I once had in my work disappeared. This grade was attached to me, to my worth as a programmer. Could I aspire to go into programming if a handful of people in my class got higher scores? 

My internship is different. Here, my team will look over my code before it can be shipped off. This is what we call code review. Your code is inserted like a brick, seeing if it works in tandem with the rest of the building. And I found this very daunting at first. It was like waiting for a grade. At university, when I got things wrong, I felt less justified in calling myself a programmer. Throughout my internship, I have got things wrong. The difference is that, rather than being pushed onto the next problem, I get to go back. 

I think it’s when you get that wiggle room to ‘fail’ that you start learning. It’s not like returning to an old exam paper and saying, “I still have no idea what this question was asking, but I can’t go back now”. Code review allows me to probe over comments and wonder what is going on with my code. How can I write this in fewer lines? What in the documentation can I use to make this simpler? Is there a function that does this? And when you find the answer, you come away happy that you learned something. 

It is also good to see coding as a style, and everyone tackling the problem differently. In code reviews, it is okay to not understand – or even disagree with the way you are being told to do something. I could try to fix an issue the way I was told, only to find out that there is a better way. It showed me it is okay if some approaches do not work. It is questioning and experimenting that allows you to learn a new and better way of doing things. This felt unlike university, where there would be one agreed way of answering the problem and no debate around it. To solve problems here, I have to uncover that best approach. Would it be one that my team had shown me before, or would I have to solve it with a novel spin? I get to talk to my team about how to approach the problem. Being heard makes me feel validated and capable as a programmer. 

Slack over stack

The image I had of a ‘good programmer’ began to change. In my mind, I saw the ‘better programmers’ in my course as quietly working away, never piping up with a question since they already had that wealth of knowledge in their heads. University is competitive. It can be more daunting to go to people who know more than you as it feels like a resignation. Even if you get over that hurdle, if your classmate gives an explanation you don’t understand, the idea that you ‘Don’t get it’ is reinforced. They are better programmers than you.

One thing a team member said to me that stuck was that a lot of software engineering is not about knowing everything, but rather knowing who to ask for help when you are stuck. At university, having questions could make me feel dumb, especially when there are people who get it, just like that. It made me feel like asking those burning questions in my mind would expose my incompetence to the class, so I would usually let them go and read up on it later. 

The truth is that if it was really that simple for people to just read up on every concept they didn’t get, I would probably not get the chance at this internship! The tech industry knows that with the world of programming constantly expanding, it is unrealistic to think that there will be someone out there who knows it all. That’s why there are teams of programmers. The rock star developer who knows everything does not exist. 

I think that’s the reason why daily standups keep teams all on the same page. Standups are short and snappy. In a way, standups can humble you – they show you that it is okay not to know everything. This invalidates the voice telling me, “If you ask questions, you will sound dumb”. Every one of my team members has asked for help, even the senior engineers. I think this makes sense – my team all had the same goal. We were brought together because these problems did not have an easy solution. We all have different coding styles and while one person’s style might work for one problem, it might not work for another. So, instead of struggling alone, you can ask to work on the problem with someone else whose strengths lie in that area.

When you learn anything, you have to delve into areas that are not your forte. I thought that it wouldn’t be okay to admit to a weakness – as I knew there were people out there who could navigate these areas with no problem. If they could do it quickly, surely it was a ‘me problem’ that I had to work on. However, when I spent so much time on these weaknesses and didn’t get it, I got frustrated. What was I missing? 

What I was missing was realising that it would be okay to go to the people who know more. When you get to hear from someone who’s in the know about your problem, it can show you a new way of thinking. Not only will this new approach help you solve your current problem, but it can also be used as an insight for those future hurdles! 

I also came to realise that my team wanted me to code as much as I did. In my internship, my team gave me a buddy. My buddy is like my mentor, happy to be the first port of call when I need help. My buddy is extremely good at stopping and breaking away from the coding side of a program to focus on what my understanding was. They also know when to bring in someone else for input. This encourages me to ask for help A LOT. I know my team would be there to give me the answer I wanted and not one convoluted with technical jargon or zero lead-ons (sorry, Stack Overflow)! I feel proud of all the code I have shipped off because I had understood what I had done. 

Letting go to learn

Now, if you were in the same situation as me before I started my internship and reading this blog post, you might be thinking, “That’s good and all, but I still don’t know if I can do this!”. I get it. It may sound counterintuitive to say you should put yourself out there – after all, what if you mess up? I won’t deny that possibility. You will mess up – but that is how you will get better. Every time I have messed up, my team has been there to pick me back up. Although it did feel humbling the first few times, I soon realised that I had people around me who wanted me to grow. It was unlike the competition I was used to at university. I was able to let go. The number of messages and Google Meets I’ve been in has gone through the roof! And, it was in that letting go, I was able to learn. That’s how I started to feel like a programmer. 

My internship project: from ideation to implementation

Posted by on August 21, 2023

When I started my internship at FreeAgent, I had no idea what to expect. I read previous blog posts and saw that interns get up to a wide range of things! I spent my first weeks here doing onboarding tasks and fixing small bugs. Once I’d started to get familiar with the codebase and the way FreeAgent works, it was time to start my project. 😮

My team manages Sudo, FreeAgent’s admin system where we can manage the accountancy practices and companies who use our app. One of Sudo’s most important pages, used by our support and sales teams to see information about accountancy practices, was in a state of disarray. A lot of information was squashed into a single page, making it difficult for our staff to use. My task was simple: redesign the page to make life easier for our incredible staff!

But where was I going to start? I hadn’t used this page myself, so I had no idea who used it and why. So I thought I’d start by talking to the people who use it most. 

A screenshot of a page containing information about an example accountancy practice. There is a lot of information crammed onto this very long page.
A very zoomed-out view of the original page, shown for a small dummy practice – for our larger practices, the page was many times the length of this!

Interviews

Firstly, I had to figure out who in the company actually used this page. My team advised me to search the logs: this required me learning where our logs are stored and how I can query them, but I got there eventually! I found that the actions on the page were mainly being performed by members of a few specific teams, so I messaged those teams to arrange interviews.

A view of FreeAgent's logs, showing a list of records with timestamps. There is also a bar chart which shows when each of these records has occured: as we'd expect, most of the records occur between the hours of 9-5.
Searching the logs to show me all the times that someone has used the page to edit an accountancy practice. For each record in the bottom right, I can see more information including the person who performed the action.

I started each of my interviews by asking the person to talk through how they used the page. Not only did this help me with identifying requirements, but it also gave me a glimpse into the workings of different teams within FreeAgent. I then asked what features on the existing page they used the most and least, and if they had any suggestions for improvements. Everyone I interviewed was really cooperative and brought lots of suggestions! Sure, it was a little nerve-wracking talking to lots of new people, but my buddy was there to support me and helped provide context on anything that I didn’t understand. After five interviews, I wrote up my findings. 📝

A screenshot of an Excel table. Each column shows the features used by a different team at FreeAgent: Practice Support, Sales, Finance, Sales Operations and Sales Implementation. The records of the table describe different features from the page: for example, Invoices, Notes, and Edit Practice. These are categorised into 'Use every time', 'Use a lot', 'Rarely use' and 'Never use'.
A table summarising my interview findings

Spec

Next up was the tech spec. A technical specification document is where you lay out your motivations and plans for the project – Brad wrote a great post on the topic last year. Luckily, my team had done a similar page redesign the previous year, so I was able to take some inspiration from their spec. I had to answer the following questions:

  • Why am I doing this project?
  • Why are my main goals?
  • What have I discovered from my research?
  • How am I implementing the project?
  • What could go wrong?
  • What questions do I still have to answer?
  • How will I know if the project has been successful?

At this early stage, I couldn’t fill out all the details. I didn’t know exactly the structure of my new page, but I had a rough idea – the finer details would resolve themselves during implementation.

Making a plan with Trello

Trello is a Kanban board tool that we use a lot here at FreeAgent. Kanban is a popular method used in Lean software development that uses lists to visualise and manage work. It’s a great way to organise our work as a team – hence why we use it so heavily – but it was also helpful for tracking the progress of my solo project. I reflected on all of the features I wanted to implement in my project and split them into cards. I assigned an urgency rating to each card, based on what I’d heard in my interviews – features that would help everyone, or would really significantly help one team were marked as most urgent. I also assigned a complexity score via a story scoring meeting – I’ll get to that in the next section. 😉

In my ‘Up Next’ list, I sorted the cards by prioritising ones with higher urgency and lower complexity. There were also a few requested features that I wanted to work on, but I knew that I had limited time, so I created a ‘Larger Changes, If Possible’ column for these. If these are left over once I’ve finished my project, I’ll add them to my team’s Trello board so someone will get round to them – teamwork makes the dream work! 🚀

Overall, this tool was a game-changer – I’m definitely going to use Trello to organise my work once I’m back in uni!

A screenshot of a Kanban board with four columns: Up Next; In Progress; Done; and Larger Changes, If Possible. Each of these columns contains many cards for different features: for example 'Add search feature to companies table' and 'Create overview tab'. Each of the cards have coloured badges showing their difficulty and urgency.
My Trello board for the project

Story scoring

While I was putting together my Trello board, I was also writing user stories. Each user story is a description of how an end user would use our new feature, along with a list of acceptance criteria. After I’d written these, I got my team together for a story scoring meeting. Story scoring is a technique used in Agile to estimate the complexity of user stories, which helps us with planning and time allocation. As the meeting host, I explained each story I’d written and answered any questions the rest of the team had. Then we all scored them based on how long we think they’re going to take.

Now, here’s the interesting part: we use a Fibonacci scale for scoring (1, 2, 3, 5, 8, 13, etc). You might be wondering why Fibonacci: I was too! I had a quick Google and found an interesting article that explains how using the Fibonacci sequence allows us to better perceive the differences between different scores. Each number in the sequence is roughly 60% larger than the previous one, so the jump from 5 to 8 feels like it carries the same weight as the jump from 8 to 13. Pretty cool, right? 

When the time comes to reveal our scores, we play ‘planning poker’. Everybody decides on their scores separately, and then we all reveal them at once. This avoids the team members influencing each other’s decisions. Then comes the most useful part, where everyone discusses why they gave the score they did. My team members all have much more experience than I do, so hearing their opinions is incredibly valuable. There were always points brought up that I hadn’t considered at all – sometimes I would assume a story would be easy, but then somebody would raise a point that complicated everything! On the other hand, I might rate a story highly because I’m clueless about how to tackle it, and somebody else would tell me that it’s already implemented elsewhere in the app. It’s all about getting everyone’s perspective and combining knowledge to decide on the best score. ✊

Continuous feedback

After I had scored my stories and set out my plan, I was ready to dive into the code! But of course, this didn’t mean that the details of the project were set in stone. The last thing I wanted to happen was to work on the code for months, finally release it, and then find out that I’d made the page worse for my users. 😟

To avoid this, I decided to keep in contact with the people I interviewed and seek their feedback while I was making changes. The logistics of this were more complicated than I thought: I didn’t want to deploy a half-finished project to everybody, but I wanted a few people to be able to see the changes I was making. Luckily, we have these nifty things called feature switches! This means that my new version of the page was hidden behind a switch, and I could choose who I wanted to turn it on for. Everyone agreed to test out the new page, so I turned the switch on for them and can now use their feedback to improve my project.

So, that’s where I am now! I have a few weeks left of my project to work on making the more difficult changes, and to make this page the best it can be. I’ve learned so much throughout the course of this project – it’s been a great summer of learning and growth surrounded by the best people. 👫👭👬

Fireside chats about tech careers and automation paranoia

Posted by on

I read a book called Coders at Work by Peter Seibel before starting university. Inspired by the format of Jessica Livingston’s Founders at Work, each chapter features an interview with an accomplished programmer. The interview style of writing feels like a personal conversation, offering a rare look into the thought processes of some very impressive people.

I’m currently in my second internship at FreeAgent. Last year, I wrote a blog post reflecting on the engineering lessons I learned from co-founding a startup.

When I was looking around the FreeAgent office, a copy of Coders at Work caught my eye. While Seibel took two years to conduct 15 interviews, I thought I’d try something similar, albeit on a smaller scale. I managed three interviews, including an intriguing chat with a physicist turned CTO, a former engineer who transitioned into management, and a data scientist who was involved in experiments at CERN.

Hopefully people find this interesting. I want to say a massive thank you to Graeme, Dave and James for taking the time to speak with me. 

Interviews

TL;DR:

Graeme, Dave and James share their unique pathways into the tech industry. Their backgrounds and experiences are diverse and interesting. Despite the threat of artificial intelligence, they all stress the importance of human judgement and problem-solving. Their insights highlight that while tools and methods might change, the core ability to understand, define and address real-world problems will always be central to what we do.

Graeme Boyd

Angus: How did you learn to program?

Graeme: My parents bought a computer when I was about five years old. I remember typing something in BASIC with my dad. I was always fascinated with writing programs and making the computer perform tasks. I’ve used many programming languages now.

When the web started emerging while I was at university, I started building websites. The first programming task I was ever paid for involved JavaScript. But to be honest, working with JavaScript in those days was quite painful.

Interestingly, I didn’t study computer science; I’m actually a physicist. But I’ve been programming for a significant portion of my life.

Angus: Was your transition into leadership intentional? Did you always aspire to be a CTO?

Graeme: Since my academic background is in physics, I wasn’t sure if a career only in programming was for me. But I wanted to build things and solve problems. 

I actually had a different career in medical research. I developed microscopes, CT systems and the accompanying 3D software. After around six years doing this, I spent a few months creating pharmacy software using Ruby in the developing world.

When I returned, I joined a little startup called FreeAgent. 

Then FreeAgent grew quickly, we reached a point where Olly (FreeAgent’s co-founder and first CTO) was managing too many people. So the engineering team was split in half, and I effectively became FreeAgent’s first engineering manager.

Later, I joined another fintech company in Edinburgh as CTO. Since this company was a lot smaller than FreeAgent, my responsibilities were similar to that of lead developer and CTO. This experience taught me a lot about what the role of a CTO looks like.

After around three years of being CTO there, I returned to FreeAgent as director of engineering before becoming CTO of FreeAgent.

Angus: Do you feel having a varied career benefited you?

Graeme: Definitely. When I wanted to be a researcher or a scientist I couldn’t really see beyond that. University often paints a picture that your life’s going to go in one direction. But I’ve now had three careers – a medical researcher, a software developer and now as a CTO. All of these experiences have given me different perspectives. What I do now (management) is very different from programming.

Angus: Do you think programming might be automated soon?

Graeme: [Gestures to the copy of Coders at Work sitting on the table between us.] I don’t think programming is going to go away, but it could look very different. Just like the people in that book were using punch cards and writing machine code by hand; that way of working feels very old-fashioned now. Programming has generally evolved over time to become more and more high level and I see current AI tools as a continuation of that trend.

I think what we’re seeing is, knowing the syntax inside out is less important than it used to be. When you wanted to do something in a language or framework, you used to go away and read manuals or textbooks. This process became faster when Google search results started returning sites like Stack Overflow.

The current systems that exist like GitHub CoPilot are impressive but they’re still quite limited. That’s not to say these systems won’t improve, but I think there will always be some type of programming needed.

Angus: It was interesting earlier that you said management is very different from programming. How do you approach major technical decisions, especially when the stakes are high?

Graeme: We often talk about “one-way doors” and “two-way doors”, which essentially means irreversible and reversible decisions. When facing an irreversible decision, you need to think very carefully about it. 

Sometimes you look back on situations and think “we made the wrong decision there”, but often you made it for the right reasons. It’s quite hard to learn from decisions without making any, though.

Angus: What are your thoughts on AI-assisted decision making?

Graeme: One of the issues with AI at the moment is you can’t determine how a model’s arrived at a decision, which makes it a bit like a black box.

To be able to trust something like that I feel you need to get to a point where you can ask: “How did you come to that decision? What’s your logic?” Maybe that fits in more with me and how I like to reach the basis of understanding something. Much of my job these days involves challenging people’s assumptions or asking them to walk me through what they’ve decided. Sometimes in doing that, we pick up new understandings and things we’ve missed through our own intuition.

It’s rare as CTO that it’s me coming up with the final solution to something. You need to build trust in people and I think the same would go for anything involving AI-assisted decision making. You would need to see it has clearly made correct decisions on similar things in the past.

Angus: With fear of automation, are there any skills you feel will still remain relevant in years to come?

Graeme: Why are we writing software? Usually to solve some problem or make someone’s life better. The hard part of software engineering has never really been about the code, but rather solving the problem.

If I go back to 20 years ago when I was freelancing, the hardest thing was understanding what people wanted. The same still rings true today. Often, users give you the solution first, like “I want a system that does this”. Getting them to explain what the problem is for me is the difficult bit. Why do they need this in the first place?

No artificial intelligence is ever going to be able to replace that process.

Dave Evans

Angus: What’s your journey been like to date?

Dave: I studied physics as an undergrad then did a PhD in particle physics. As part of my PhD I moved to Geneva to work at CERN on the CMS experiment.

After that, I wanted to get a postdoc and went to UCSD. I was in San Diego for around six months before another four years or so at CERN.

It was a really interesting time. Part of the appeal was getting to travel. I got to participate in some of the first measurements at the LHC (Large Hadron Collider). I also contributed to the search for the Higgs boson. It was a lot of fun but plenty of late nights and stressful times.

When my postdoc was finishing up, I was thinking about leaving the field to do something different. The field of data science was starting to become more popular.

I spoke to a few different companies but I was impressed with what FreeAgent was doing and I liked the idea of living in Edinburgh. When I joined FreeAgent I was the only person doing data science. I’ve been glad to be able to build up the data function since then.

I got to lead the introduction of some interesting technologies, like the machine learning models that we’re currently using in production.

Angus: That’s fascinating. Did your background in physics help your transition to data science? It sounds like it was a natural progression.

Dave: Absolutely. I think there are a few different dimensions to that. When I was considering a move to the commercial sector, which is often referred to as “industry” in the academic world, a friend mentioned that there’s a stereotype that physicists just relax and drink tea all day. That’s far from the truth. Success in the academic world is hard work.

The problems you’re trying to solve in physics are typically very difficult. I was working with around 3,000 colleagues, and a very large organisation of people has its own internal structure and politics. You need to get to know people and how things are done.

On the technical side of things, experimental physics revolves around data – recording, analysing and using it to deduce information about our physical world. It involves skills like statistics, programming, data analysis and visualisation, which are essential in data science. I was also working on problems that involved machine learning; although the jargon changed and we didn’t call it machine learning back then. The same skills I gained working on physics problems carried over to my work here at FreeAgent.

Physics is all about seeing the simplest, quickest way to get to an answer in a very difficult situation. Then using that new understanding to infer something about the physical world around us.

Angus: As someone primarily focused on software engineering, I’m curious: what does a typical day look like for a data scientist?

Dave: Drinking cups of tea? No, I think it varies. The title “data scientist” is broad. Its meaning varies from one company or even one team to another. 

At FreeAgent, a data scientist typically works on building and optimising machine learning models and deploying them in production. This role involves machine learning, building infrastructure as code, and a deep understanding of our customers and their needs.

Tasks could involve selecting the right training data or choosing the appropriate technology for the desired prediction accuracy. Then building monitoring, deployment and inference infrastructures to measure the model’s performance.

Angus: Something I keep seeing when researching AI is model explainability. Why can explaining a model’s output be so challenging?

Dave: Explainability depends on the context and the model’s intended use. For example, predicting insurance premiums based on the price of a house requires a simple model that’s easy to explain. But understanding the exact reasoning behind a highly sophisticated model’s response can be much more challenging.

It’s part of a data scientist’s job to assess the importance of explainability against other factors like model accuracy, evaluation speed, and regulatory or ethical requirements. Sometimes, it might be as simple as drawing a line through data points, or at other times, as complicated as seeking answers from massive models. The key is to determine what’s most suitable for the problem at hand.

Something like linear regression, which has been used in the financial sector for decades, is straightforward to explain.

But on the other end of the spectrum, you have things like LLMs (Large Language Models) and generative AI. Now you’re talking about something with billions of parameters, trained on an enormous amount of data that’s maybe not that well understood.

I think it’s unfair to treat all techniques the same. Some models will be easy to explain and others not so.

Angus: With new technologies, there’s often an initial surge of interest, followed by a reality check regarding the actual costs vs benefits.

Dave: Technology often follows a pattern of hype cycles. When I was stepping into the data science world, “big data” was a massive buzzword. The industry was grappling with its definition, and for many, it ended up meaning an expensive engagement with a consultancy for a tech solution they didn’t truly need.

Not every problem requires an elaborate solution, and it’s important to find the most straightforward, cost-effective method for the task at hand. While there’s undeniable excitement around LLMs right now, they’re not a one-size-fits-all answer.

Just like with “big data”, or the rise of blockchain and cryptocurrency, every few years, a new technology comes along that’s touted as the next big thing. While they may have some lasting impact, it’s often not in the way we initially envisaged.

Angus: I’m starting to accept that AI could start automating some tasks in software engineering soon. But I think challenges in search of solutions will stay. If AI can help us get through the tedious parts faster, that’s a win.

Dave: There’s a fallacy in economics which I think is called the lump of labour fallacy. To say artificial intelligence is going to take over all jobs assumes there’s a fixed amount of work. New technology changes the landscape of jobs; while some may vanish, others emerge. The key is net productivity increase, creating more opportunities and jobs. 

With the introduction of machines during the industrial revolution, there was a predicted dramatic drop in working hours that never materialised. 

The important thing is how we decide to use AI – whether we use it to benefit society or cause unforeseen harms, like the negative impacts of social media on mental health. The dangers lie not in AI itself but in how we use and shape it in a societal context.

James Bell

Angus: What was your path to becoming an engineering manager?

James: I went to university to study computer engineering. It was a mix of high-level electronics, digital electronics and low-level programming. It was the kind of degree designed to teach you about building a mobile phone and the operating system the phone runs on.

Throughout my career, I’ve always had an interest in system-level thinking. Rather than focusing on individual projects from start to finish, I’ve always liked the bigger picture – understanding how things interconnect and how people collaborate within these systems.

I spent time at Yahoo! and a few smaller companies. When I first joined FreeAgent, I worked as an engineer for about five years. Transitioning into a management role was initially daunting for me. Despite having a decade of engineering experience, I felt like a beginner again when I first became a manager.

Angus: What initially drew you to software engineering? Was it passion?

James: I’ve always liked messing about with computers. I remember building Linux machines and assembling hardware in 1996/97. I wasn’t the typical tinkerer but I enjoyed the problem-solving element, much like writing. Can I solve this problem in a unique way and express it somehow?

I also realised I have a passion for helping others. It’s not just about being grand and noble; it’s simply about enjoying the act of helping people.

Angus: Have you witnessed significant shifts or advancements in the industry?

James: While there’s always noise about new trends, a lot are just repetitions of past concepts. 

There’s been a massive shift in how people work and collaborate. The number of quality developer-friendly tools has been a big change.

Things like distributed version control. But also something like Rails, right? There’s a lot of stuff Rails does for you. You should eventually learn how this stuff works under the hood. But you can get quite a long way without worrying about it.

Angus: Do you think software engineers will be automated?

James: It might be the tools are different and the places they go to are different. But problem-solving and the ability to translate real-world processes into digital ones will always be essential.

Angus: As an engineering manager, are there things you do that you’d like to see automated?

James: While I think much of the coordination work I do is essential and hard to replace, some tasks like scheduling aren’t significant concerns. Better alerting and monitoring systems that actively work with you could be useful. Also generating reports, especially ones I find not inherently valuable to construct but useful to analyse. This would free up more time for more valuable tasks.

Angus: With automation? Do you think we should be worried?

James: I think it’s good to be cautious, but many of the widespread fears about the death of certain professions due to automation are likely exaggerated.

I think the fear isn’t really about automation but more about capitalism. People worry their jobs might be replaced by cheaper, possibly lower-quality solutions. Given the track record of the companies currently driving automation, this makes sense. There’s certainly cause for concern. However, automation could also have a lot of positives.

Angus: Does this remind you of anything you’ve seen before?

James: AI does feel a bit different. It seems more widespread, potentially affecting various sectors and professions in one large sweep.

When the second iteration of Apple’s iOS devices introduced the App Store, there was a genuine fear among web developers that mobile apps would render websites obsolete. Over time, both platforms found their balance and continued to grow.

A few years back, people were talking about cryptocurrency. People believed it was going to revolutionise databases, enable zero-trust environments and reshape our financial systems. While it made waves, I would say it’s yet to have the same impact people anticipated.

When FreeAgent launched, many accountants felt threatened. They believed we were automating their jobs. We always emphasised our goal was to remove the tedious aspects of their roles, not replace the expertise they offer.

Combining data from different sources with SageMaker pipelines

Posted by on August 2, 2023

Generating datasets for machine learning

Preparing data and generating datasets is a crucial step to train a machine learning model. If you are lucky your data might come from a single .csv file. However in most cases pulling together the input features to train your machine learning model will require combining datasets from different sources. Combining data from different sources manually can be a time consuming process, prone to errors.  

At FreeAgent, we know that training our machine learning models regularly with additional more recent data is key to maintaining the model performance. With this in mind we decided to find a solution to automate the generation of files to train and evaluate the model. 

In this post we will describe how we can use a SageMaker pipeline to combine files from different sources. We will show how to use a SageMaker processing job to query data in Athena and Redshift and combine the query outputs with other files saved in S3 to generate training and evaluation datasets.

When we worked on this project we found it difficult to find documentation to help us build this pipeline. We hope that the following example will be useful to someone else and please do not hesitate to let us know of any comments or questions.

SageMaker pipelines

SageMaker pipelines are a feature supported by Amazon Sagemaker that allows the creation of end-to-end machine learning workflows, which are composed of steps. 

We already used a SageMaker pipeline to train our model, it made sense to use a similar approach to create our datasets.

Workflow

The purpose of this pipeline is outlined in the workflow below. In our case we needed to build a proof of concept model by combining data from three different sources:

  • Customer attributes stored in CSV format in S3
  • Standard FreeAgent feature usage data from our Redshift data warehouse
  • Highly nested data returned by a third party queried by using Athena

The outputs of the queries will be saved in S3. Using a Python script we would then combine the query outputs with the additional CSV files already in S3. This script is run in a processing step, a type of step which can be used in a SageMaker pipeline workflow to run a data processing job. The combined files will then be split into training and evaluation datasets which will be saved in a specific location in S3.

Setting up the query strings

When selecting specific features for a machine learning problem, queries can be very long and impractical to hard-code in the SageMaker pipeline. An alternative is to save the Athena and the Redshift queries in your project in an athena.sql and redshift.sql file respectively and use the following function to read them into a variable.

def read_query_string_from_file(file_path): 
    with open(file_path, "r") as sql_file:
        query=sql_file.read()
    return query

Each query string can be read using the appropriate formatting. Note that the formatting for the query in Redshift was particularly sensitive to single and double quotes. If your query is not formatted correctly, your pipeline would fail. If possible, we would recommend testing your query directly in the Redshift query editor before testing your pipeline.

athena_query_string=read_query_string_from_file("query.athena.sql")

redshift_query_string=read_query_string_from_file(
    "query.redshift.sql"
).replace("'", "\\'")

Configuring the DatasetDefinition

We mentioned before that we will process the data with a processing step. The inputs of the processing step are one of either an input file loaded from S3 (S3Input) or DatasetDefinition types. The DatasetDefinition types support data sources which can be queried via Athena and Redshift. 

AthenaDatasetDefinition

We can now use the query string above in a AthenaDatasetDefinition input to the pipeline. The AthenaDatasetDefinition has the following required attributes: 

  • Catalog: the name of the catalog used in the Athena query execution
  • Database: the name of the database used in the Athena query execution
  • OutputFormat: the data storage format for the Athena query results
  • OutputS3Uri: the location in Amazon S3 where the Athena query results will be stored
  • QueryString: the SQL query statement to be executed

The Athena dataset definition does not support CSV as an output format. In this example we chose to select the `TEXTFILE` OutputFormat where the outputs are saved as compressed unicode-encoded files.

from sagemaker.dataset_definition.inputs import AthenaDatasetDefinition

def athena_query_output_dataset():
    return AthenaDatasetDefinition(
        catalog="data_catalog",
        database="database_name",
        query_string=athena_query_string,
        output_s3_uri="s3_uri_query_output_location",
        output_format="TEXTFILE",
    )

RedshiftDatasetDefinition

We can also use the query string above in a RedshiftDatasetDefinition input. The RedshiftDatasetDefinition has a following required attributes: 

  • ClusterId: the Redshift cluster identifier
  • ClusterRoleArn: the IAM role attached to your Redshift cluster that Amazon SageMaker uses to generate datasets
  • Database: the name of the Redshift database used in the Redshift query execution
  • DbUser: the database user name used in the Redshift query execution
  • OutputFormat: the data storage format for the Redshift query results
  • OutputS3Uri: the location in Amazon S3 where the Redshift query results will be stored
  • QueryString: the SQL query statement to be executed

You also need to make sure that your db_user has permission to query the table in Redshift.

from sagemaker.dataset_definition.inputs import RedshiftDatasetDefinition

def redshift_query_output_dataset():
    return RedshiftDatasetDefinition(
        cluster_id="cluster_id", 
        cluster_role_arn="cluster_role_arn",
        database="database_name", 		   
        db_user="user",
        query_string=redshift_query_string,
        output_s3_uri="s3_uri_query_output_location",
        output_format="CSV",
    )

Putting everything together in the processing step

Once we have configured the Athena and Redshift DatasetDefinition we can create a processing step. To create a processing step we need a processor to define the environment our processing script should be run in (such as the container and the type of instance). In the example below we show how to use a SKLearnProcessor, which allows you to create a processing job with scikit-learn and its dependencies available to use. There is also the possibility of customising your processor with your own Docker image (with your specific dependencies) using the sagemaker.processing ScriptProcessor.

from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(
    framework_version="1.0-1",
    role="role_arn",
    instance_type="ml.t3.medium",
    instance_count=1, 
    command=["python"],
    base_job_name="generate_datasets",
)

We can then define the inputs and outputs of our processing job as well as the python code which will be used to combine the inputs and split them into the training and evaluation dataset outputs.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.dataset_definition.inputs import DatasetDefinition
from sagemaker.workflow.steps import ProcessingStep

def generate_training_and_evaluation_datasets():
    return ProcessingStep(
        name="generate-datasets",
        processor=sklearn_processor,
        inputs=[
            ProcessingInput(
                source="file_s3_uri",
                destination="location_on_container",
                input_name="file_input",
            ),
            ProcessingInput(
                input_name="athena_input",
                dataset_definition=DatasetDefinition(
                    local_path="location_on_container", 
                    athena_dataset_definition=athena_query_output_dataset(),
                ),
                destination="location_on_container",
            ),
            ProcessingInput(
                input_name="redshift_input",
                dataset_definition=DatasetDefinition(
                    local_path="location_on_container",
                    redshift_dataset_definition=redshift_query_output_dataset(),
                ),
                destination="location_on_container",
            ),
        ],
        outputs=[
            ProcessingOutput(
                source="location_on_container",
                destination="s3_uri",
                output_name="training_data",
            ),
            ProcessingOutput(
                source="location_on_container",
                destination="s3_uri",
                output_name="evaluation_data",
            ),
        ],
        code="generate_training_and_evaluation_datasets.py",
    )

The query outputs are generated in multiple files (compressed in the case of the Athena query) and without column headers. We added some functionality as part of the python script to combine all the outputs (or compressed output) in a given location into a single pandas DataFrame with the column names matching the feature fields extracted in each query. 

The rest of the python script joined the various DataFrames and split the combined inputs into the training and evaluation datasets.

Running the pipeline

All that remained was to create the pipeline itself to run the processing step and build the pipeline.

def build_pipeline():
    return Pipeline(
        name="data-pipeline",
        pipeline_experiment_config=PipelineExperimentConfig(
            "experiment_name", 
            "trial_name",
        ),
        steps=[generate_training_and_evaluation_datasets()],
    )

Summary

We successfully implemented a SageMaker pipeline which automatically ran queries in Redshift and Athena and combined the query outputs with files located in S3 to generate training and evaluation datasets for our classification model. 

We now have the flexibility to add functionality to run the data generation pipeline and our model training pipeline back to back. 

The pipeline currently only comprises a single step and could be simplified to a single ProcessingStep. However, we chose to use a SageMaker pipeline so we had the flexibility to add other steps to our workflow in the future, such as data quality checks on the datasets generated. 

As mentioned in the introduction, when we worked on this project we found it difficult to find documentation to help us build this pipeline. We hope that it will be useful to someone else and please do not hesitate to let us know of any comments or questions.