Micro-batching Event Data Into Amazon Redshift

Posted by Tobias Haar on March 25, 2019

Data is at the heart of our business. We use data to make business critical decisions on a daily basis. It is important that this data is not only accurate but also available when required. Traditionally reports would be generated at a set schedule which made it difficult to decide on next steps in a timely fashion. New technologies like Amazon Kinesis Data Streams enable us to generate these reports in near real time.

In this blog post we describe the software system responsible for processing our event data and making it available for visualisations and reports in our business intelligence tool. We will also look at the positive impact that AWS managed services can have to lower the burden on our operations team.

Architecture

Overview

To efficiently write event data into Redshift we have built a number of services, each with a very specific task:

Producer – This application receives event data via HTTP and writes the data to an Amazon Kinesis data stream using the Kinesis Producer Library (KPL).
Consumer – The consumer application is a stream processing service that turns events into smaller batches that can be imported independently.
Importer – The importer is responsible for copying batches of events from S3 into Redshift.

A fault in any one of these services must not impact the entire pipeline. A number of supporting AWS services and external services monitor the health of this system:

CloudWatch Logs – Our logs are stored in CloudWatch and can easily be accessed by our engineers.
CloudWatch Metrics – Application performance and overall health of the system is tracked using CloudWatch.
CloudWatch Alarms – Breaching thresholds or unusual variance in these metrics is constantly monitored and will trigger an appropriate action to make us aware of any change (e.g. Slack or PagerDuty notification).
Rollbar – Application errors are all tracked in a central place using Rollbar.

Below is a diagram that outlines the architecture.

Receiving Events

The main entrypoint of this software system is our Producer. The producer is a Java application that runs a Vert.x web server to receive webhooks from both external and internal systems. Vert.x has shown significant performance characteristics in benchmarks like the TechEmpower Framework Benchmark.

Incoming events are first validated and then handed over to the Kinesis Producer Library (KPL) which is provided by Amazon Web Services (AWS). The KPL solves a number of challenging problems when dealing with data streams and specifically with Amazon Kinesis. Some of the benefits of using the KPL are:

Record aggregation for better performance and cost optimisations
Publishing of metrics for monitoring
Efficient batching of records with retry logic

We decided to keep all our events in JSON format for readability and flexibility when it comes to publishing events. In the future we would like to explore some alternatives that could enforce a schema and perhaps reduce the size of our events to optimise for cost.

We monitor the request rate, error rate and latency as some of the key indicators for application performance.

Generating Micro Batches

In the Consumer application we make use of another library offered by AWS, the Kinesis Consumer Library (KCL). Like the KPL, the KCL deals with all the complex and difficult tasks like record deaggregation, metric publication and load balancing across different workers when using multiple partitions.

It is worth noting that this application is written in Java. Whilst there is support for other languages to use the KCL, we found that the support and performance of using the KCL directly is far superior to using the MultiLang Daemon.

Events received from the Kinesis stream are converted into facts and dimensions which are easily queryable from Redshift. These facts and dimensions are buffered into individual chunks of CSV files as shown in the image below.

We write this data to S3 along with some metadata (the target table and the columns included in the file) on a set schedule or before the buffers get too big to fit in memory. We checkpoint our position in the stream when we have successfully flushed the buffers. This is important to avoid duplicate records in our data warehouse.

If you have used Kinesis Firehose before, you will notice that this concept sounds very similar and begs the question: Why have we written an application that does exactly that? One drawback of the Kinesis Firehose that we found is the fact that a Firehose can only target a single Redshift table at a time. That makes it very inflexible for our use case as we would like to support 100s of event types in the future, all with very different payloads and therefore columns. Complex record transformations are also easier to deal with in a general purpose language like Java.

At this point our event throughput is low enough that we don’t need to compress these CSV files. As our throughput increases we would like to explore the different compression algorithms which are supported by Redshift and see how this impacts copy performance and upload latency.

The health of the application can be monitored by checking the processing lag. Fortunately, the KCL publishes application level metrics like MillisBehindLatest that can be a good indication if your pipeline is slowing down or has stopped entirely.

Importing Data

Once a buffer is successfully written to S3 we use S3 bucket notifications in conjunction with SQS to finally import data into our Redshift cluster. We have agreed on a convention where all data loads are written to the loads/ prefix. All data unloads (using the UNLOAD command) are written to the unloads/ prefix.

The Importer continuously receives messages from the SQS queue with information about new files to be imported. The attached metadata is used to determine the correct Redshift table and which columns are included in the CSV file. This service is configured to always run exactly one service at a time on our ECS cluster. During application startup we perform any necessary schema changes which are checked into our source control system. These schema changes are tested in a staging environment before they are applied in our production environment. We create regular backups of our data in Redshift to mitigate the impact of cluster failure.

We issue a COPY command in order to import data into Redshift. This is the recommended and most efficient way to import data into Redshift.

On the rare occasion when an import fails due to intermittent connectivity issues or during cluster maintenance we use an exponential backoff strategy in our application to retry any failed attempts to import data. Our queue is configured to send any message that couldn’t be successfully processed after a certain number of attempts to a special dead letter queue which is monitored via CloudWatch. This prevents us from blocking the pipeline if a corrupt file is received. Thanks to the configured monitoring we can start investigating failed imports without losing the message.

Visualisations & Reports

With the data available in Redshift, we can start to build models to represent this information. We use Looker to generate rich visualisations and reports from our event data. These reports can be generated and retrieved in a self service fashion which is very positively received by the rest of the business.

We will publish a blog post in the near future about how we use Looker to generate visualisations and reports for our business needs. Watch this space!

Summary

The solution described above allows us to efficiently move event data into our data warehouse with minimum latency (time for event data to be published before available in Redshift). With at most one minute of latency we can generate rich visualisations and reports based on live data. The solution can also cope with most faults that we can anticipate at this point without losing any data. The use of managed services, where appropriate, makes the maintenance of this system very easy.

If these sorts of challenges interest you and you are passionate about building data-centric applications then check out our careers page. We are hiring into our Data Platform team.

Surviving your first winter as a remote worker

Posted by John Fredrickson on March 22, 2019

Many of FreeAgent’s engineering team work remotely, spread all around the UK, generally visiting our Edinburgh office once a quarter.

FreeAgent Remote Engineer map

(FreeAgent’s remote engineering team – blue arrows show full-time staff, yellow show contractors.)

I joined FreeAgent in April 2018 and this was my first experience of being a remote worker. Even though I’d sometimes been able to do one day a week working from home in previous roles, the reality of full-time remote was quite different from what I expected.

Being a remote worker is amazing. I certainly don’t miss the four hours a day I used to spend commuting to London and back, but some aspects of being in a physical office environment alongside work colleagues were hard to leave behind. I felt like I needed to replicate these in order to be happy and productive. I assumed that because I had a (company supplied) brilliant office chair, decent working space and desk with great IT kit, that this was all I needed. Wrong! There were some surprising other things that I gradually discovered, and experimented with, some of which have really made a difference to how I work, to my mood and doubtless to my wellbeing also.

Now that we are (finally!) entering spring, I’ve taken some time to reflect on the tactics that helped me survive, and thrive during my first winter as a remote worker.

Eat healthily during the day

My activity level initially dropped more than I expected, being a remote worker. Working in a big office means walking up flights of stairs, a couple of 10 minute walks to/from the station, or from car/motorbike parking, and that all stopped when I started working from home. After 3 months, there were definitely a few pounds gained! I had more time, but I didn’t want to spend a lot of time preparing healthy, and what I thought were boring, meals. I soon learned that lower carb / lower sugar eating helped me feel more positive, especially at lunchtime.

Here is one lunch that takes little preparation time, yet tastes great.

Salmon with roasted vegetables

Heat oven to 180C (fan)
Place a salmon fillet on a sheet of baking foil that will be large enough to wrap around the salmon and make a parcel with a bit of air space
Pour over a couple of tablespoons of lemon juice and a sprinkling of mixed herbs
Wrap silver foil around the salmon to loosely seal it, and place on oven-safe tray or dish
Cover a baking sheet with silver foil
Place a few shoots of asparagus or tenderstem broccoli (or both) on the silver foil
Couple of sprays of low calorie olive oil or other oil
Add some lazy garlic or some chilli seeds for optional extra excitement!

Place salmon in oven for 6 mins, then add the vegetables and cook for a further 12 mins (so 18 mins in total). Done!

Salmon and roasted vegetables

I tend to pick up fruit and vegetables 2-3 times a week from my local supermarket, or the town market (on the day it runs). This is a chance to exercise, go outside, and chat to real people face-to-face rather than via Google Meet!

We also have a Slack channel at FreeAgent that discusses food and healthy eating, so there are colleagues to discuss menu options, recipes and suggestions with. It’s certainly added to my recipe collection. One of my colleagues even runs his own recipe blog – take a look!

Have your workspace at a comfortable temperature

Not so difficult to do in the summer months, but I found in winter the room I have for an office took ages to get properly warm, even with domestic central heating. Your working space may need to be warmer during the working day than the rest of your home – getting cold when working at a desk doesn’t feel productive, it feels miserable! The answer? A basic heater with a thermometer. This allows you to get the room up to 20-21 degrees (or your temperature of choice) and keep it there.

Heater control panel

Have the right lighting in your workspace

This was something several of my colleagues helped with, via our internal #remote Slack channel. As autumn evenings started drawing in, I noticed I could do with more decent light in my home office space.

I use a LIFX light as my main light and I generally have it set to the daylight colour temperature, which my eyes and mood prefer. The LIFX has a massive range of adjustments for colour temperature, brightness, as well as special effects. Being able to control via Siri is hardly essential, but it does provide some entertainment! Occasionally there are special offers, either via LIFX directly or via Amazon.

Colour temperature setting

I use an LED desk light which, again, has selectable colour temperature and is dimmable.

Some folks have put strip lights behind their monitors for bias lighting but I haven’t felt the need to do this. The combination of my desk light and decent main light has been great so far.

I love good music around me (I’m a musician and former sound engineer), and since I’m not in a shared office, I don’t have to confine myself to headphone listening. A major feature of the office is:

Yamaha HS7 speakers – active studio monitors. Each speaker has two separate amplifiers built in, one for the bass speaker and one for the treble. The philosophy is that you hear exactly what the recording engineer intended.
Speakers are fed by an AudioQuest DragonFly USB Digital-to-Analog converter.
The majority of my music comes from Spotify at present, but I’m looking to trial one of the higher-resolution streaming services, namely Qobuz Studio or Tidal during the coming months.

All in all, this is what it adds up to:

Home desk setup

Oh, by the way… having your chair set up properly and your monitor at the right height is also amazingly important. More luxurious brands of monitor stand are available!

Not only is this a comfortable space for me to work from, it gets a thumbs-up from the cat also!

Cat snoozing on my desk

Include regular exercise in your day

I try to go on at least two walks every working day, to keep active and avoid “never going out of the house” syndrome. I’ve found two things helpful in keeping me motivated to do this.

Firstly I have a smartwatch that tracks my exercise. I find it encouraging to see I’ve done at least 30 minutes of exercise a day. Some folks at FreeAgent buddy up around exercise, even the remote staff. At the moment, my buddy for exercise is my watch, but perhaps later in the year I’ll branch out here.

Secondly, having a decent coffee shop almost 15 mins walk away is a great excuse for some exercise! Dual incentive, though I must try to avoid the amazing cakes!

my local coffee shop

Everything above has come about because of my new role at FreeAgent, the opportunity to work remotely, and the amazing colleagues who are not only interested in cutting code and developing product features but are also happy to share ideas and discuss suggestions to make your whole life, not just your work life, happier and healthier.

If this sounds like the sort of work/life balance that you’d love to have, I should probably point out that we’re hiring!

Separating job applicants in multiple dimensions

Posted by Rob Brown on March 19, 2019

The team I work in at FreeAgent is achieving great things – from rolling out a new Business Intelligence tool, to working on machine learning models to improve our product. With so many ideas but not enough time to action them, we recently advertised a number of roles to expand our team.

FreeAgent is a superb place to work, and the roles are a real opportunity for someone to achieve a lot personally and professionally, so I probably should not have been surprised by the volume of applications we received – but I was! We had applicants from all around the world, and the vast majority of them had a compelling set of skills and experience on their CV that made them viable candidates – but I knew that recruiting the right candidate was incredibly important.

Estimates vary in how much a poor hire can cost a business in financial terms, but it is clear that bad hires do have some adverse impact on finances, as well as negative repercussions for employee morale, productivity and engagement. They also carry with them a larger risk of causing reputational damage to the business, and so I was determined to get it right.

Having written the job spec, I knew what the role entailed. I knew which skills would be needed. But I still found it hard to assess many candidates – one may have strong analysis skills on their CV, but less experience of stakeholder management. Another may have less analytical experience, but appear to be very technically proficient. In short, they all had some skills that fitted aspects of the role, so how could I fairly and rationally progress some applications but not others?

After a few sessions of trying to review applicants, I hit upon a method that seemed to work; I chose two aspects of the role that were the most important, and plotted them on axes to make a quadrant chart. I could then mark areas on the chart which would indicate suitability for the role.

Thinking about where each candidate fell on these axes helped me to assess the applications in a timely manner, and – more importantly – to progress the applicants with the best chance of succeeding in the role. If there was a candidate that was hard to place, or that fell outside of the progress/do not progress zones, the chart was a useful tool to frame the discussion. It definitely helped to clarify things during the review process, and is an approach I will consider using again the next time I am bogged down in CVs.

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team