Combining text with numerical and categorical features for classification

Posted by on May 17, 2024

Classification with transformer models

A common approach for classification tasks with text data is to fine-tune a pre-trained transformer model on domain-specific data. At FreeAgent we apply this approach to automatically categorise bank transactions, using raw inputs that are a mixture of text, numerical and categorical data types.   

The current approach is to concatenate the input features for each transaction into a single string before passing to the model. 

For example a transaction with the following set of features:

DescriptionAmountIndustry
Plastic pipes£-7.99Plumbing

could be represented by the string:  Plastic pipes [SEP] -7.99 [SEP] Plumbing with [SEP] being a special token indicating the boundaries between sentences, which is used to separate the features. This is a simple way to combine features that have yielded positive performance. We’ve even noticed that the model exhibits unexpectedly good interpretations of the amount feature.

This approach however doesn’t feel like the most natural way to represent the information contained in a numerical feature like the transaction amount. An alternative approach is to use the transformer model to extract a vector representation of the text features, concatenate this with features extracted from non-text data types and pass the combination to a classifier.

In this post we compare the performance of these two approaches using a dataset of wine reviews to classify wines into 3 classes (neutral, good and excellent). Diagrams 1 and 2 below give an overview of the process for getting from input features to output categories in the two approaches.

Diagram 1: Concatenating all inputs into a single string to fine-tune a base transformer model.
Diagram 2: Preprocessing numerical and categorical features independently from the text, extracting the description vector representation and train a lightGBM classifier.

Public wine reviews dataset

We use the Kaggle wine reviews dataset for this experiment, which contains a variety of descriptive features for just under 130,000 wines, including a wine score. The wine score is a number between 1-100, with 100 being the best, although this dataset only includes reviews for wines that score over 80. For the purpose of this blog post, we created the wine rating from the scores – wines between 80 and 86 (incl) were flagged as “neutral”, between 87 and 93 as “good” and greater than 94 as “excellent”. 

To keep things simple we will use just the wine description, price and variety to attempt to classify the wines into these 3 ratings. 

We loaded the data and created a target variable rating to split the wine scores into the 3 target categories (neutral, good, excellent).

import pandas as pd

wine_df = pd.read_csv("data/wine_data.csv")
bins = [0, 87, 94, np.inf]
names = ["neutral", "good", "excellent"]

wine_df["rating"] = pd.cut(wine_df["points"], bins, labels=names)

We can then keep a DataFrame of the features and target. For the purpose of putting together this example we have limited the model input features to the wine variety (the type of grapes used to make the wine), the price (cost for a bottle of wine) and the wine description.

NUMERICAL_FEATURE = "price"
CATEGORICAL_FEATURE = "variety"
TEXT_FEATURE = "description"
TARGET = "rating"
FEATURES = [TEXT_FEATURE, NUMERICAL_FEATURE, CATEGORICAL_FEATURE]

wine_df = wine_df[FEATURES + [TARGET]]

We then split our wine dataframe into an 80:20 training:evaluation split. 

from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(wine_df, test_size=0.2)

LightGBM classifier approach

Preprocessing

Preprocessing numerical and categorical features

In a first step we will preprocess the numerical and categorical features using SKLearn’s ColumnTransformer.  For the numerical features (price) we decided to fill missing values using the median wine price and scale them using a StandardScaler. 

We preprocessed the categorical features by filling missing values with “other” and OneHot encoded them. 

We saved the output of the ColumnTransformer as a pandas DataFrame to concatenate it later with the vector representation of the text.

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
)

def preprocess_number():
    return make_pipeline(
        SimpleImputer(strategy="median"),
        StandardScaler(),
    )

def preprocess_categories():
    return make_pipeline(
       SimpleImputer(strategy="constant", fill_value="other", missing_values=np.nan),
       OneHotEncoder(handle_unknown="ignore", sparse_output=False),
    )

def create_preprocessor():

    transformers = [
        ("num_preprocessor", preprocess_number(), [NUMERICAL_FEATURE]),
        ("cat_preprocessor", preprocess_categories(), [CATEGORICAL_FEATURE]),
    ]

    return ColumnTransformer(transformers=transformers, remainder="drop")

column_transformer = create_preprocessor()
column_transformer.set_output(transform="pandas")
preprocessed_num_cat_features_df = column_transformer.fit_transform(
    train_df[[NUMERICAL_FEATURE, CATEGORICAL_FEATURE]]
)

Extracting text vector representation with a transformer model

We then moved on to the preprocessing of the text features. We decided to use distilbert-base-uncased as the base transformer model to extract the vector representation of the wine description. 

BERT-type models use stacks of transformer encoder layers. Each of these blocks processes the tokenized text through a multi-headed self-attention step and a feed-forward neural network, before passing outputs to the next layer. The hidden state is the output of a given layer. The [CLS] token (short for classification) is a special token that represents an entire text sequence. We choose the hidden state of the [CLS] token in the final layer as the vector representation of our wine descriptions.

In order to extract the [CLS] representations, we first transform the text inputs into a Dataset of tokenized PyTorch tensors.  

For this step we tokenized batches of 128 wine descriptions padded to a fixed length of 120 suitable for our wine descriptions of ~40 words. The max_length is one of the parameters which should be adjusted depending on the length of the text feature to prevent truncating longer inputs. The padding is necessary to ensure we will process fixed-shape inputs with the transformer model. The tokenizer returns a dictionary of input_ids, attention_mask and token_type_ids. Only the input_ids (the indices of each token in the wine description) and the attention mask (binary tensor indicating the position of the padded indice) are required inputs to the model. 

The code for the tokenization is shown below:

from datasets import Dataset
from transformers import AutoTokenizer

MODEL_NAME = "distilbert-base-uncased"

def tokenized_pytorch_tensors(
        df: pd.DataFrame,
        column_list: list
    ) -> Dataset:

    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
    transformers_dataset = Dataset.from_pandas(df)

    def tokenize(model_inputs_batch: Dataset) -> Dataset:
        return tokenizer(
            model_inputs_batch[TEXT_FEATURE],
            padding=True,
            max_length=120,
            truncation=True,
        )

    tokenized_dataset = transformers_dataset.map(
        tokenize,
        batched=True,
        batch_size=128
    )

    tokenized_dataset.set_format(
        "torch",
        columns=column_list
    )
    
    columns_to_remove = set(tokenized_dataset.column_names) - set(column_list)

    tokenized_dataset = tokenized_dataset.remove_columns(list(columns_to_remove))

    return tokenized_dataset

print("Tokenize text in Dataset of Pytorch tensors")
train_df[TEXT_FEATURE] = train_df[TEXT_FEATURE].fillna("")
tokenized_df = tokenized_pytorch_tensors(
    train_df[[TEXT_FEATURE]],
    column_list=["input_ids", "attention_mask"]
)

Now we are in a position to extract the vector representation of the text using our pre-trained model. The code below shows how we returned the last hidden state of our tokenized text inputs and saved the transformer Dataset into a pandas DataFrame. 

import torch
from transformers import AutoModel

def hidden_state_from_text_inputs(df) -> pd.DataFrame:

    def extract_hidden_states(batch):
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        model = AutoModel.from_pretrained(MODEL_NAME)

        inputs = {
            k: v.to(device)
            for k, v in batch.items()
            if k in tokenizer.model_input_names
        }

        with torch.no_grad():
            last_hidden_state = model(**inputs).last_hidden_state
            # get the CLS token, which is the first one
            # [:, 0] gives us a row for each batch with the first column of 768 for each
            return {"cls_hidden_state": last_hidden_state[:, 0].cpu().numpy()}

    cls_dataset = df.map(extract_hidden_states, batched=True, batch_size=128)
    cls_dataset.set_format(type="pandas")

    return pd.DataFrame(
        cls_dataset["cls_hidden_state"].to_list(),
        columns=[f"feature_{n}" for n in range(1, 769)],
    )

print("Extract text feature hidden state")
hidden_states_df = hidden_state_from_text_inputs(tokenized_df)
print(f"Data with hidden state shape: {hidden_states_df.shape}") 

All that remains to be done before we can train our classifier is to concatenate the preprocessed features together.

print("Saving preprocessed features and targets")
preprocessed_data = pd.concat(
    [
        preprocessed_num_cat_features_df,
        hidden_states_df,
        train_df[TARGET]
    ],
    axis=1
)

Train a LightGBM model

Encode target and format input names

To train our classifier we first needed to encode our categorical target variable into an integer. To prevent issues when training the LightGBM classifier, we renamed the feature columns and removed non alpha-numeric characters from the names. 

We then trained our classifier as shown below:

import lightgbm as lgbm

features = [col for col in list(preprocessed_data.columns) if col not in [TARGET, "encoded_target"]]

# create the model
lgbm_clf = lgbm.LGBMClassifier(
    n_estimators=100,
    max_depth=10,
    num_leaves=10,
    objective="multiclass",
)

lgbm_clf.fit(preprocessed_data[features], preprocessed_data["encoded_target"])

Evaluate the results

We preprocessed our evaluation data in a consistent way and generated predictions. The predicted category in this case is the category with the highest model output score. We use the SKLearn accuracy_score method to compare the actual and predicted categories for this approach, achieving an accuracy of 0.81.

Fine-tuning a transformer model

Update the training script

We used a HuggingFace estimator to fine-tune a distilbert-base-case base model using a train.py training script as the entry point. This base script was updated to include the number of labels for our classifiers.

Preprocess the data and train the model

The code below shows how we concatenated text inputs and trained the transformer model. We have re-used a lot of the functionality used in the previous approach particularly to tokenize the data.

import sagemaker
from sagemaker.huggingface import HuggingFace
from sagemaker.inputs import FileSystemInput

from datasets.filesystems import S3FileSystem

ROLE = sagemaker.get_execution_role()

# preprocessing
TARGET_CATEGORIES = ["neutral", "good", "excellent"]
le = LabelEncoder().fit(TARGET_CATEGORIES)
train_df["labels"] = le.transform(train_df[TARGET])

def generate_text_input(df):
    # converting all columns to string type
    df[FEATURES] = df[FEATURES].astype(str)
    df[FEATURES] = df[FEATURES].fillna("")
    df["text"] =df[FEATURES].agg(' [SEP] '.join, axis=1)
    return df

train_df = generate_text_input(train_df, FEATURES)

# tokenization
tokenized_train_df = tokenized_pytorch_tensors(
    train_df[["text", "labels"]],
    column_list=["input_ids", "attention_mask", "labels"]
)
s3 = S3FileSystem() 
tokenized_train_df.save_to_disk(“s3://path_to_training_data”, fs=s3)


# training
hyperparameters={
    "epochs": 1,
    "train_batch_size": 128,
    "model_name": “distilbert-base-uncased”,
}

huggingface_estimator = HuggingFace(
    entry_point="train.py",
    source_dir="s3://path_to_training.tar.gz”",
    output_path="s3://path_to_outputs",
    instance_type='ml.g4dn.xlarge',
    instance_count=1,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters,
    role=ROLE
)

huggingface_estimator.fit(
    {'train': training_path}
)

Evaluation

We evaluated the performance of the trained model on evaluation data and with this approach achieved an accuracy of 0.85. 

It is interesting to see that this model trained with concatenated inputs gave slightly better accuracy. It is possible that this is the result of updating the base transformer model’s parameters to our specific classification task with our application specific text inputs.  It is however difficult to draw any conclusion from the accuracy of the 2 approaches presented above as we did not spend any time optimising the models.

Conclusions

We created two workflows to combine text data with numerical and categorical features for classification.

We found that the model trained with concatenated inputs with an accuracy of 0.85 outperformed the lightGBM classifier (with an accuracy 0.81 respectively). In comparison a model which predicted “good” (the most frequent category) for all reviews achieved an accuracy of 0.59. This may outline the benefits of the feed-forward / back propagation approach where the base transformer model’s weights are adjusted to a specific task using a problem specific text. 

Training a transformer model is also more resource intensive than training a lightGBM one which could also be a consideration in selecting one of the 2 approaches. While the feature preprocessing of the 2 models will be similar to tokenize the text inputs the requirements for model training will be different. The lightGBM model can be trained using a CPU instance whereas it is typical to use a more expensive GPU for transformer based models to speed up training and inference. 

We found it difficult to find documentation to help us build the lightGBM workflow detailed in this post and we hope that it will be useful to someone else and please do not hesitate to let us know of any comments or questions.

Useful links