Fine-tuning a DistilBERT classifier with numerical and text inputs
Text classification is often done through fine-tuning of a pretrained foundation model with domain-specific data. In FreeAgent we use transformer based models to automatically classify incoming bank transactions. Specifically we use a DistilBERT model that is fine-tuned on hundreds of millions of bank transactions with customer-labelled accounting categories.
The model inputs are currently text-based, built from a combination of bank transaction descriptions and amounts.
In this post we describe an approach to fine-tuning the DistilBERT model and training the classifier including the numerical amount feature as a single network. Continue reading
Structured outputs with Pydantic AI
One of the challenges of working with LLMs is getting them to respond with a consistent format, such as a given JSON schema. Anyone who has tried to solve this issue with prompt engineering knows how frustrating it can be. You add a ‘MUST’ here and an ‘always return JSON’ there, but still the output doesn't reliably parse. Maybe you're about to add a try-except block to handle parsing errors… Continue reading
How we Use Dagster Automations in our Data Pipeline
Introduction The heart of a reliable data platform are robust and automated data pipelines. As our team migrates our data pipelines to Dagster, re-architecting our automation logic is a crucial task. Dagster offers condition-based approaches to creating or updating a data asset (table or file), moving us toward a modern, asset-centric view of data. This post details the three primary automation strategies we considered and implemented (so far) in Dagster—Schedules,… Continue reading
Streamlining DBT Macro Testing: A Unit Test Approach with Pytest and Jinja
Introduction Data Engineering at FreeAgent has a mission to ensure our colleagues and customers have reliable, accurate, and secure access to the data they need. Our migration to Dagster, DBT, and DLT is a key part of that. However, it has raised numerous questions, including how we test DBT Macros. This post dives into how we're tackling this by leveraging pytest and Jinja for more efficient unit testing of DBT… Continue reading
Creating re-usable descriptions in dbt with Jinja docs
If you're working with dbt and find yourself copying the same column descriptions across multiple models, this post is for you. We'll show you how to eliminate that repetition using the Jinja doc function! Continue reading
Decoding Data Orchestration Tools: Comparing Prefect, Dagster, Airflow, and Mage
Introduction Data is exploding, and so are the tools to manage it. From generating and collecting, to cleaning and analyzing, these tools help create valuable products for customers and give stakeholders decisive insights. As Data Engineering at Freeagent continues to evolve, we're focusing on providing more reliability and quality in our data products. For data pipeline building, we've started to move from a no-code approach toward a software engineering focused… Continue reading
The 5 rules for migrating data pipelines successfully
This blog will help you to discover the 5 essential rules to navigate your large-scale data tooling transition smoothly and with minimum disruption. Continue reading
Introducing Analytics Engineering
Over the last few years we’ve evolved the way our analytics team works to enable easy access to accurate and reliable data for faster, better decision-making. Recently we made one more change—our Business Intelligence Analysts are now Analytics Engineers! Continue reading
Combining text with numerical and categorical features for classification
Classification with transformer models A common approach for classification tasks with text data is to fine-tune a pre-trained transformer model on domain-specific data. At FreeAgent we apply this approach to automatically categorise bank transactions, using raw inputs that are a mixture of text, numerical and categorical data types. The current approach is to concatenate the input features for each transaction into a single string before passing to the model. For… Continue reading
Restructuring our analytics team
In late 2022, we restructured our analytics team by aligning each analyst to a different area of the business. In this blog post I’ll talk about what we changed, why we changed it, and how we feel the changes have gone so far. If you’ve been through a similar process (or even the opposite process!), are considering it, or have any other thoughts, we’d love to chat! Please drop us… Continue reading