Fine-tuning a DistilBERT classifier with numerical and text inputs

Text classification is often done through fine-tuning of a pretrained foundation model with domain-specific data. In FreeAgent we use transformer based models to automatically classify incoming bank transactions. Specifically we use a DistilBERT model that is fine-tuned on hundreds of millions of bank transactions with customer-labelled accounting categories.

The model inputs are currently text-based, built from a combination of bank transaction descriptions and amounts.

For example we would use:

Name	Amount
TESCO PAY AT PUMP 1234 GB	£-96.51

The description and amount are combined into one string TESCO PAY AT PUMP 1234 GB [SEP] -96.51 separated by a dedicated [SEP] token. Using the amount as a string is a simple approach that works well in practice. However, we are curious whether adding the amount as an additional numeric feature after encoding the text could lead to performance benefits.

We have discussed a similar idea before in a previous blog post, where we trained a separate LightGBM classifier that took the vector representation from the transformer model and the numerical feature as inputs. In this post we describe an approach to fine-tuning the DistilBERT model and training the classifier including the numerical amount feature as a single network.

Kaggle wine reviews dataset

In this post we experiment with the Kaggle wine reviews dataset, which contains wine reviews taken from Wine Enthusiast and contains text, categorical and numerical features of just under 130,000 wines. The wine points are binned to create the target rating feature. The points can range between 1–100, however, this dataset only contains wines with points above 80; We categorised these into three groups: 80–86 as neutral, 87–93 as good, and 94–100 as excellent. For simplicity, we limit our input to two features: the description (text) and the price (numerical).

import pandas as pd
import numpy as np

wine_df = pd.read_csv("archive/winemag-data-130k-v2.csv", index_col=0)
bins = [0, 87, 94, 100]
TARGET_CATEGORIES = ["neutral", "good", "excellent"]
wine_df["rating"] = pd.cut(wine_df["points"], bins, labels=TARGET_CATEGORIES)

df = pd.DataFrame({
    "description": wine_df['description'],
    "rating": wine_df['rating'],
    "price": wine_df['price'],
})

Preprocessing

Now that the dataframe is set up, let’s discuss the preprocessing step to get our data ready for training.

The initial preprocessing involves converting the categorical rating classes into numerical labels, dropping null rows and creating a training, validation and testing set.

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

le = LabelEncoder().fit(TARGET_CATEGORIES)
df["labels"] = le.transform(df['rating'])
df = df.dropna(subset=["description", "labels", "price"], ignore_index=True)

train_val_df, test_df = train_test_split(df, test_size=0.1, random_state=123)
train_df, val_df = train_test_split(train_val_df, test_size=0.1, random_state=123)

test_df.to_csv("test.csv", index=False)

Rescaling numerical data

We rescale the price feature using the StandardScaler in scikit-learn. We fit and apply the scaler to the training data after which we apply it to the validation data and we save it for use on the testing set when evaluating the model.

from sklearn.preprocessing import StandardScaler
import pickle

def scale_numeric_feature(series, scaler):
    return scaler.transform(series)

scaler = StandardScaler()
scaler.fit(train_df[["price"]])

with open("scaler/scaler.pkl", "wb") as f:
    pickle.dump(scaler, f)

train_df["price"] = scale_numeric_feature(train_df[["price"]], scaler)
val_df["price"] = scale_numeric_feature(val_df[["price"]], scaler)

The model code, detailed in the sections below, expects any features used in addition to the text data to be stored under the column named additional_features.

Since we are only utilising one numeric feature, it must be contained within a list to be properly processed during the model’s forward pass. For users incorporating multiple additional features, ensure these are formatted as a 1D array (e.g., [feature_1, feature_2, ..., feature_N]) for every data point.

def add_additional_features(df, feature_columns):
    features = []
    for col in feature_columns:
        features.append(df[col].values)
    df["additional_features"] = list(np.column_stack(features))
    df = df.drop(columns=feature_columns)
    return df

train_df = add_additional_features(train_df, ["price"])
val_df = add_additional_features(val_df, ["price"])

Tokenizing the description inputs

For this experiment we use DistilBERT as our base model and aim to train this using the Trainer class from the transformers library. We tokenise the description, returning a Dataset object with the following attributes: ['labels', 'additional_features', 'input_ids', 'attention_mask']

from transformers import AutoTokenizer
from datasets import Dataset

huggingface_model = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)
batch_size = 128

def tokenize_inputs(df, tokenizer):
    dataset = Dataset.from_pandas(df)
    tokenized_inputs = dataset.map(lambda x: tokenizer(x["description"],
                                    padding=True, truncation=True),
                                    batched=True, batch_size=batch_size)
    tokenized_dataset = tokenized_inputs.remove_columns(["description"])
    return tokenized_dataset

train_tokenized_dataset = tokenize_inputs(train_df, tokenizer)
val_tokenized_dataset = tokenize_inputs(val_df, tokenizer)

Modifying the model and classification head

The classification head and sequence classification pipeline need some small modifications to be able to train with our additional features. The code listed in this section is modified from this Google collab notebook.

In the first instance we define a new ClassificationHead class with updated dimensionality. The classification head will take as input the configuration file of the backbone transformer model and the number of extra dimensions resulting from the additional features being added. In our case we will use a single extra dimension, giving a linear layer size of 769, instead of the 768 DistilBERT hidden layer.

import torch
import torch.nn as nn

class ClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""
    def __init__(self, config):
        super().__init__()
        num_extra_dims = config.num_extra_dims
        total_dims = config.hidden_size+num_extra_dims
        self.dense = nn.Linear(total_dims, total_dims)
        self.dropout = nn.Dropout(config.dropout)
        self.out_proj = nn.Linear(total_dims, config.num_labels)

    def forward(self, features, **kwargs):
        x = self.dropout(features)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x

Next a custom sequence classification pipeline is defined as a subclass of DistilBertForSequenceClassification. The key change here is to update the forward method so that the text features will pass through the backbone transformer model to produce the [CLS] embedding and the additional features are concatenated to that before classification. For example, if we start with a [CLS] embedding of size 768 and add two numeric features, they are combined to form a single array of length 770. Categorical cross-entropy loss is used, as we are predicting a single label for each record.

from transformers import DistilBertModel,DistilBertForSequenceClassification
from transformers.modeling_outputs import SequenceClassifierOutput

class CustomForSequenceClassification(DistilBertForSequenceClassification):

    def __init__(self, config, num_extra_dims):
        super().__init__(config)
        self.num_labels = config.num_labels
        config.num_extra_dims = num_extra_dims
        config.task = "text-classification"
        self.config = config

        # Add the DistilBertModel with the classifier
        self.distilbert = DistilBertModel(config)
        self.classifier = ClassificationHead(config)

        # Initialize weights and apply final processing
        self.post_init()


    def forward(
        self,
        input_ids: torch.LongTensor | None = None,
        attention_mask: torch.FloatTensor | None = None,
        additional_features: torch.FloatTensor | None = None,
        token_type_ids: torch.LongTensor | None = None,
        position_ids: torch.LongTensor | None = None,
        head_mask: torch.FloatTensor | None = None,
        inputs_embeds: torch.FloatTensor | None = None,
        labels: torch.LongTensor | None = None,
        output_attentions: bool | None = None,
        output_hidden_states: bool | None = None,
        return_dict: bool | None = None,
    ) -> tuple | SequenceClassifierOutput:


        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.distilbert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            head_mask=head_mask, 
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        # sequence_output will be (batch_size, seq_length, hidden_size)
        sequence_output = outputs[0]

        # additional data should be (batch_size, num_extra_dims)
        cls_embedding = sequence_output[:, 0, :]

	 # add the additional features to the output of the DistilBERT
        output = torch.cat((cls_embedding, additional_features), dim=-1)

        logits = self.classifier(output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

The modified architecture processes two types of input. Specifically, the text feature is first transformed into the [CLS] embedding via the DistilBERT model. This embedding is then concatenated with the numerical feature, and the resulting vector is used as the input for the classifier. This overall flow is visually represented in the diagram below.

Architecture diagram showing the two paths of the inputs: a wine description passes through DistilBERT to create a 768-dimensional [CLS] embedding, while a price passes through as a 1-dimensional numerical feature. Both are then concatenated to form a 769-dimensional vector that feeds into the final Classifier head.

Training the modified model

The above changes were made in this way to keep the model compatible with the transformers Trainer class. We use the from_pretrained method of the custom sequence classification pipeline to define the model, which is then passed into the Trainer with the training and evaluation datasets.

from transformers import AutoTokenizer

huggingface_model = "distilbert-base-uncased"

model = CustomForSequenceClassification.from_pretrained(
        huggingface_model,
        num_labels=3,
        num_extra_dims=1,
    )
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)

from transformers import Trainer, TrainingArguments

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(num_train_epochs=1)
trainer = Trainer(model=model, 
            args = training_args,
            train_dataset=train_tokenized_dataset, 
            tokenizer=tokenizer, 
            compute_metrics=compute_metrics,
            eval_dataset=val_tokenized_dataset)


trainer.train()
eval_result = trainer.evaluate(eval_dataset=val_tokenized_dataset)

model_dir = "custom_model"
trainer.save_model(model_dir)

The model was trained for one epoch and then evaluated on the test set, achieving an accuracy of 85%. For reference, a simple majority vote prediction has an accuracy of 59%.

Inference on the trained model

In order to get predictions from our trained model we will again use a modified method from the transformers package. We define our CustomTextClassificationPipeline which inherits from the TextClassificationPipeline class. Within this class we modify the preprocess method to make sure that only the text inputs are tokenised and the additional features are added as tensors to the tokenised inputs.

from transformers import TextClassificationPipeline

class CustomTextClassificationPipeline(TextClassificationPipeline):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def preprocess(self, inputs, **tokenizer_kwargs):
        '''We overwrite the preprocess method of the TextClassificationPipeline to include the additional features
        after tokenization of the text inputs. '''

        # Call the original preprocess method to get the tokenized inputs
        tokenized_inputs = super().preprocess(inputs["description"], **tokenizer_kwargs)
        #Put the additional features back into the tokenized inputs
        tokenized_inputs["additional_features"] = torch.tensor([inputs["additional_features"]])

        return tokenized_inputs

During inference the trained model is loaded and used to instantiate a transformers pipeline object, specifying the CustomTextClassificationPipeline defined above. Predictions are made by calling this pipeline on new data.

from transformers import pipeline
from transformers import AutoConfig, AutoTokenizer
import os
import pickle 

model_dir = "custom_model"
config = AutoConfig.from_pretrained(model_dir)
model = CustomForSequenceClassification.from_pretrained(model_dir, config = config, num_extra_dims = config.num_extra_dims)
tokenizer = AutoTokenizer.from_pretrained(model_dir)

with open("scaler/scaler.pkl", "rb") as f:
    scaler = pickle.load(f)

test_df = pd.read_csv("test.csv")
test_df["price"] = scale_numeric_feature(test_df[["price"]], scaler)
preprocessed_df = add_additional_features(test_df, ["price"])
inference_ds = Dataset.from_pandas(preprocessed_df[['additional_features', "description"]])

task = "text-classification"
#Recommended to set the task as a env variable "HF_TASK"
os.environ["HF_TASK"] = task


hf_pipeline = pipeline(
    task=task,
    model=model,
    tokenizer=tokenizer,
    pipeline_class=CustomTextClassificationPipeline,
    top_k=None,
    batch_size=batch_size,
)


label_scores = hf_pipeline(
        list(inference_ds),
        padding=True,
        truncation=True
    )

Model Comparison

We compare our method with the full text-based transformer model, here we use both the AutoModelForSequenceClassification and AutoTokenizer from the transformers library, with the same DistilBERT model to predict the neutral, good, and excellent wine classes.

from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer

huggingface_model = "distilbert-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(
        huggingface_model,
        num_labels=3
    )
tokenizer = AutoTokenizer.from_pretrained(huggingface_model)

We combine the features into one string input using both the description and wine price, separated by the special [SEP] token. For preprocessing we use the same steps as discussed above apart from the rescaling of the numerical features as these features are now seen as text and thus included in the tokenization.

The text based model is trained and evaluated using the same dataset and hyperparameters. The achieved accuracy is 85%, with the accuracy of the two methods being the same for multiple runs of the model and similar evaluation loss values. We trained both models with a single epoch and a training set of about 104,000 samples. The training of the numerical model was slightly faster compared to the text based model, but only of the order of minutes in this particular experiment.

df = pd.DataFrame({
    "description": wine_df['description'] + " [SEP] " + wine_df['price'].astype('str'),
    "labels": wine_df['rating']
})

TARGET_CATEGORIES = ["neutral", "good", "excellent"]
le = LabelEncoder().fit(TARGET_CATEGORIES)
df["labels"] = le.transform(df['labels'])

df = df.dropna(subset=["description", "labels"]).copy()

train_val_df, test_df = train_test_split(df, test_size=0.1, random_state=123)
train_df, val_df = train_test_split(train_val_df, test_size=0.1, random_state=123)

train_tokenized_dataset = tokenize_inputs(train_df, tokenizer)
val_tokenized_dataset = tokenize_inputs(val_df, tokenizer)
test_tokenized_dataset = tokenize_inputs(test_df, tokenizer)

training_args = TrainingArguments(num_train_epochs=1)
trainer = Trainer(model=model,
args=training_args, train_dataset=train_tokenized_dataset, tokenizer=tokenizer, 
compute_metrics=compute_metrics,
eval_dataset=val_tokenized_dataset)

trainer.train()

eval_result = trainer.evaluate(eval_dataset=val_tokenized_dataset)
task = "text-classification"
#Recommended to set the task as a env variable "HF_TASK"
os.environ["HF_TASK"] = task

hf_pipeline = pipeline(
    task=task,
    model=model,
    tokenizer=tokenizer,
    top_k=1,
    batch_size=batch_size,
)

label_scores = hf_pipeline(
        list(test_df["description"]),
        padding=True,
        truncation=True
    )

Conclusion

We used two methods to train a transformer model to predict wine categories using text and numeric based inputs. In this example, training with the amount as a text feature and including the amount as a numeric feature in the network gave an equivalent categorisation accuracy of 85%. These both outperform training a separate LightGBM classifier method from our previous blog post which showed an accuracy of 82%.

So when might it be worth adding additional non-text features to the network? Adding the features as text will increase the token lengths of the inputs which in turn can lead to increased latency as the maximum token length needs to be increased, also keeping in mind the DistilBERT model has a maximum token length of 512. Training times could also be a factor as in our experiment the numerical model was slightly faster to train than the text based model. Other considerations could be on whether the model should use the exact amount as this may introduce overfitting or memorization of exact amounts leading to poorer generalisation.

While there are many sources on this topic of training models with combining text and numerical inputs we were missing the documentation on the inference and the reading from the directory to seamlessly integrate it into our pipelines. We hope that it will be useful to someone else and please do not hesitate to let us know of any comments or questions.

Useful links

Read about the machine learning-driven transaction categorisation in FreeAgent in the blog post
Blog on transformer model fine-tuning: Fine-tuning Bert for multiclass classification
The Google Collab notebook that inspired this approach, used for the CustomForSequenceClassification and the ClassificationHead
YouTube video by Chris McCormick on Mixing BERT with Categorical and Numerical Features

Grinding Gears

Tales of code crunching from the FreeAgent Engineering team