Text classification is often done through fine-tuning of a pretrained foundation model with domain-specific data. In FreeAgent we use transformer based models to automatically classify incoming bank transactions. Specifically we use a DistilBERT model that is fine-tuned on hundreds of millions of bank transactions with customer-labelled accounting categories.
The model inputs are currently text-based, built from a combination of bank transaction descriptions and amounts.
For example we would use:
| Name | Amount |
| TESCO PAY AT PUMP 1234 GB | £-96.51 |
The description and amount are combined into one string TESCO PAY AT PUMP 1234 GB [SEP] -96.51 separated by a dedicated [SEP] token. Using the amount as a string is a simple approach that works well in practice. However, we are curious whether adding the amount as an additional numeric feature after encoding the text could lead to performance benefits.
We have discussed a similar idea before in a previous blog post, where we trained a separate LightGBM classifier that took the vector representation from the transformer model and the numerical feature as inputs. In this post we describe an approach to fine-tuning the DistilBERT model and training the classifier including the numerical amount feature as a single network.
Kaggle wine reviews dataset
In this post we experiment with the Kaggle wine reviews dataset, which contains wine reviews taken from Wine Enthusiast and contains text, categorical and numerical features of just under 130,000 wines. The wine points are binned to create the target rating feature. The points can range between 1–100, however, this dataset only contains wines with points above 80; We categorised these into three groups: 80–86 as neutral, 87–93 as good, and 94–100 as excellent. For simplicity, we limit our input to two features: the description (text) and the price (numerical).
import pandas as pd
import numpy as np
wine_df = pd.read_csv("archive/winemag-data-130k-v2.csv", index_col=0)
bins = [0, 87, 94, 100]
TARGET_CATEGORIES = ["neutral", "good", "excellent"]
wine_df["rating"] = pd.cut(wine_df["points"], bins, labels=TARGET_CATEGORIES)
df = pd.DataFrame({
"description": wine_df['description'],
"rating": wine_df['rating'],
"price": wine_df['price'],
})
Preprocessing
Now that the dataframe is set up, let’s discuss the preprocessing step to get our data ready for training.
The initial preprocessing involves converting the categorical rating classes into numerical labels, dropping null rows and creating a training, validation and testing set.
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
le = LabelEncoder().fit(TARGET_CATEGORIES)
df["labels"] = le.transform(df['rating'])
df = df.dropna(subset=["description", "labels", "price"], ignore_index=True)
train_val_df, test_df = train_test_split(df, test_size=0.1, random_state=123)
train_df, val_df = train_test_split(train_val_df, test_size=0.1, random_state=123)
test_df.to_csv("test.csv", index=False)
Rescaling numerical data
We rescale the price feature using the StandardScaler in scikit-learn. We fit and apply the scaler to the training data after which we apply it to the validation data and we save it for use on the testing set when evaluating the model.
from sklearn.preprocessing import StandardScaler
import pickle
def scale_numeric_feature(series, scaler):
return scaler.transform(series)
scaler = StandardScaler()
scaler.fit(train_df[["price"]])
with open("scaler/scaler.pkl", "wb") as f:
pickle.dump(scaler, f)
train_df["price"] = scale_numeric_feature(train_df[["price"]], scaler)
val_df["price"] = scale_numeric_feature(val_df[["price"]], scaler)
The model code, detailed in the sections below, expects any features used in addition to the text data to be stored under the column named additional_features.
Since we are only utilising one numeric feature, it must be contained within a list to be properly processed during the model’s forward pass. For users incorporating multiple additional features, ensure these are formatted as a 1D array (e.g., [feature_1, feature_2, ..., feature_N]) for every data point.
def add_additional_features(df, feature_columns): features = [] for col in feature_columns: features.append(df[col].values) df["additional_features"] = list(np.column_stack(features)) df = df.drop(columns=feature_columns) return df train_df = add_additional_features(train_df, ["price"]) val_df = add_additional_features(val_df, ["price"])
Tokenizing the description inputs
For this experiment we use DistilBERT as our base model and aim to train this using the Trainer class from the transformers library. We tokenise the description, returning a Dataset object with the following attributes: ['labels', 'additional_features', 'input_ids', 'attention_mask']
from transformers import AutoTokenizer from datasets import Dataset huggingface_model = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(huggingface_model) batch_size = 128 def tokenize_inputs(df, tokenizer): dataset = Dataset.from_pandas(df) tokenized_inputs = dataset.map(lambda x: tokenizer(x["description"], padding=True, truncation=True), batched=True, batch_size=batch_size) tokenized_dataset = tokenized_inputs.remove_columns(["description"]) return tokenized_dataset train_tokenized_dataset = tokenize_inputs(train_df, tokenizer) val_tokenized_dataset = tokenize_inputs(val_df, tokenizer)
Modifying the model and classification head
The classification head and sequence classification pipeline need some small modifications to be able to train with our additional features. The code listed in this section is modified from this Google collab notebook.
In the first instance we define a new ClassificationHead class with updated dimensionality. The classification head will take as input the configuration file of the backbone transformer model and the number of extra dimensions resulting from the additional features being added. In our case we will use a single extra dimension, giving a linear layer size of 769, instead of the 768 DistilBERT hidden layer.
import torch import torch.nn as nn class ClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() num_extra_dims = config.num_extra_dims total_dims = config.hidden_size+num_extra_dims self.dense = nn.Linear(total_dims, total_dims) self.dropout = nn.Dropout(config.dropout) self.out_proj = nn.Linear(total_dims, config.num_labels) def forward(self, features, **kwargs): x = self.dropout(features) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x
Next a custom sequence classification pipeline is defined as a subclass of DistilBertForSequenceClassification. The key change here is to update the forward method so that the text features will pass through the backbone transformer model to produce the [CLS] embedding and the additional features are concatenated to that before classification. For example, if we start with a [CLS] embedding of size 768 and add two numeric features, they are combined to form a single array of length 770. Categorical cross-entropy loss is used, as we are predicting a single label for each record.
from transformers import DistilBertModel,DistilBertForSequenceClassification
from transformers.modeling_outputs import SequenceClassifierOutput
class CustomForSequenceClassification(DistilBertForSequenceClassification):
def __init__(self, config, num_extra_dims):
super().__init__(config)
self.num_labels = config.num_labels
config.num_extra_dims = num_extra_dims
config.task = "text-classification"
self.config = config
# Add the DistilBertModel with the classifier
self.distilbert = DistilBertModel(config)
self.classifier = ClassificationHead(config)
# Initialize weights and apply final processing
self.post_init()
def forward(
self,
input_ids: torch.LongTensor | None = None,
attention_mask: torch.FloatTensor | None = None,
additional_features: torch.FloatTensor | None = None,
token_type_ids: torch.LongTensor | None = None,
position_ids: torch.LongTensor | None = None,
head_mask: torch.FloatTensor | None = None,
inputs_embeds: torch.FloatTensor | None = None,
labels: torch.LongTensor | None = None,
output_attentions: bool | None = None,
output_hidden_states: bool | None = None,
return_dict: bool | None = None,
) -> tuple | SequenceClassifierOutput:
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.distilbert(
input_ids=input_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
head_mask=head_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
# sequence_output will be (batch_size, seq_length, hidden_size)
sequence_output = outputs[0]
# additional data should be (batch_size, num_extra_dims)
cls_embedding = sequence_output[:, 0, :]
# add the additional features to the output of the DistilBERT
output = torch.cat((cls_embedding, additional_features), dim=-1)
logits = self.classifier(output)
loss = None
if labels is not None:
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(logits, labels)
if not return_dict:
output = (logits,) + outputs[2:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
The modified architecture processes two types of input. Specifically, the text feature is first transformed into the [CLS] embedding via the DistilBERT model. This embedding is then concatenated with the numerical feature, and the resulting vector is used as the input for the classifier. This overall flow is visually represented in the diagram below.
![Architecture diagram showing the two paths of the inputs: a wine description passes through DistilBERT to create a 768-dimensional [CLS] embedding, while a price passes through as a 1-dimensional numerical feature. Both are then concatenated to form a 769-dimensional vector that feeds into the final Classifier head.](https://engineering.freeagent.com/wp-content/uploads/2026/04/diagram_numerical_distilbert_blogpost-2-724x1024.jpg)
Training the modified model
The above changes were made in this way to keep the model compatible with the transformers Trainer class. We use the from_pretrained method of the custom sequence classification pipeline to define the model, which is then passed into the Trainer with the training and evaluation datasets.
from transformers import AutoTokenizer
huggingface_model = "distilbert-base-uncased"
model = CustomForSequenceClassification.from_pretrained(
huggingface_model,
num_labels=3,
num_extra_dims=1,
)
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)
from transformers import Trainer, TrainingArguments
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {"accuracy": (predictions == labels).mean()}
training_args = TrainingArguments(num_train_epochs=1)
trainer = Trainer(model=model,
args = training_args,
train_dataset=train_tokenized_dataset,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
eval_dataset=val_tokenized_dataset)
trainer.train()
eval_result = trainer.evaluate(eval_dataset=val_tokenized_dataset)
model_dir = "custom_model"
trainer.save_model(model_dir)
The model was trained for one epoch and then evaluated on the test set, achieving an accuracy of 85%. For reference, a simple majority vote prediction has an accuracy of 59%.
Inference on the trained model
In order to get predictions from our trained model we will again use a modified method from the transformers package. We define our CustomTextClassificationPipeline which inherits from the TextClassificationPipeline class. Within this class we modify the preprocess method to make sure that only the text inputs are tokenised and the additional features are added as tensors to the tokenised inputs.
from transformers import TextClassificationPipeline
class CustomTextClassificationPipeline(TextClassificationPipeline):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def preprocess(self, inputs, **tokenizer_kwargs):
'''We overwrite the preprocess method of the TextClassificationPipeline to include the additional features
after tokenization of the text inputs. '''
# Call the original preprocess method to get the tokenized inputs
tokenized_inputs = super().preprocess(inputs["description"], **tokenizer_kwargs)
#Put the additional features back into the tokenized inputs
tokenized_inputs["additional_features"] = torch.tensor([inputs["additional_features"]])
return tokenized_inputs
During inference the trained model is loaded and used to instantiate a transformers pipeline object, specifying the CustomTextClassificationPipeline defined above. Predictions are made by calling this pipeline on new data.
from transformers import pipeline
from transformers import AutoConfig, AutoTokenizer
import os
import pickle
model_dir = "custom_model"
config = AutoConfig.from_pretrained(model_dir)
model = CustomForSequenceClassification.from_pretrained(model_dir, config = config, num_extra_dims = config.num_extra_dims)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
with open("scaler/scaler.pkl", "rb") as f:
scaler = pickle.load(f)
test_df = pd.read_csv("test.csv")
test_df["price"] = scale_numeric_feature(test_df[["price"]], scaler)
preprocessed_df = add_additional_features(test_df, ["price"])
inference_ds = Dataset.from_pandas(preprocessed_df[['additional_features', "description"]])
task = "text-classification"
#Recommended to set the task as a env variable "HF_TASK"
os.environ["HF_TASK"] = task
hf_pipeline = pipeline(
task=task,
model=model,
tokenizer=tokenizer,
pipeline_class=CustomTextClassificationPipeline,
top_k=None,
batch_size=batch_size,
)
label_scores = hf_pipeline(
list(inference_ds),
padding=True,
truncation=True
)
Model Comparison
We compare our method with the full text-based transformer model, here we use both the AutoModelForSequenceClassification and AutoTokenizer from the transformers library, with the same DistilBERT model to predict the neutral, good, and excellent wine classes.
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
huggingface_model = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
huggingface_model,
num_labels=3
)
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)
We combine the features into one string input using both the description and wine price, separated by the special [SEP] token. For preprocessing we use the same steps as discussed above apart from the rescaling of the numerical features as these features are now seen as text and thus included in the tokenization.
The text based model is trained and evaluated using the same dataset and hyperparameters. The achieved accuracy is 85%, with the accuracy of the two methods being the same for multiple runs of the model and similar evaluation loss values. We trained both models with a single epoch and a training set of about 104,000 samples. The training of the numerical model was slightly faster compared to the text based model, but only of the order of minutes in this particular experiment.
df = pd.DataFrame({
"description": wine_df['description'] + " [SEP] " + wine_df['price'].astype('str'),
"labels": wine_df['rating']
})
TARGET_CATEGORIES = ["neutral", "good", "excellent"]
le = LabelEncoder().fit(TARGET_CATEGORIES)
df["labels"] = le.transform(df['labels'])
df = df.dropna(subset=["description", "labels"]).copy()
train_val_df, test_df = train_test_split(df, test_size=0.1, random_state=123)
train_df, val_df = train_test_split(train_val_df, test_size=0.1, random_state=123)
train_tokenized_dataset = tokenize_inputs(train_df, tokenizer)
val_tokenized_dataset = tokenize_inputs(val_df, tokenizer)
test_tokenized_dataset = tokenize_inputs(test_df, tokenizer)
training_args = TrainingArguments(num_train_epochs=1)
trainer = Trainer(model=model,
args=training_args, train_dataset=train_tokenized_dataset, tokenizer=tokenizer,
compute_metrics=compute_metrics,
eval_dataset=val_tokenized_dataset)
trainer.train()
eval_result = trainer.evaluate(eval_dataset=val_tokenized_dataset)
task = "text-classification"
#Recommended to set the task as a env variable "HF_TASK"
os.environ["HF_TASK"] = task
hf_pipeline = pipeline(
task=task,
model=model,
tokenizer=tokenizer,
top_k=1,
batch_size=batch_size,
)
label_scores = hf_pipeline(
list(test_df["description"]),
padding=True,
truncation=True
)
Conclusion
We used two methods to train a transformer model to predict wine categories using text and numeric based inputs. In this example, training with the amount as a text feature and including the amount as a numeric feature in the network gave an equivalent categorisation accuracy of 85%. These both outperform training a separate LightGBM classifier method from our previous blog post which showed an accuracy of 82%.
So when might it be worth adding additional non-text features to the network? Adding the features as text will increase the token lengths of the inputs which in turn can lead to increased latency as the maximum token length needs to be increased, also keeping in mind the DistilBERT model has a maximum token length of 512. Training times could also be a factor as in our experiment the numerical model was slightly faster to train than the text based model. Other considerations could be on whether the model should use the exact amount as this may introduce overfitting or memorization of exact amounts leading to poorer generalisation.
While there are many sources on this topic of training models with combining text and numerical inputs we were missing the documentation on the inference and the reading from the directory to seamlessly integrate it into our pipelines. We hope that it will be useful to someone else and please do not hesitate to let us know of any comments or questions.
Useful links
- Read about the machine learning-driven transaction categorisation in FreeAgent in the blog post
- Blog on transformer model fine-tuning: Fine-tuning Bert for multiclass classification
- The Google Collab notebook that inspired this approach, used for the CustomForSequenceClassification and the ClassificationHead
- YouTube video by Chris McCormick on Mixing BERT with Categorical and Numerical Features