In a previous post we described how to fine-tune a BERT model for bank transaction categorisation using Amazon SageMaker and the Hugging Face transformers library. We’ve come along a fair bit since then and are preparing to get this model into production. One of the steps for doing this was to train the model on a much larger dataset than we used during initial experimentation.
Training time scales linearly with the size of the training dataset so we were expecting ~30 hours to train our model on an ml.p3.2xlarge instance (single GPU). To get this down to something more reasonable we wanted to use an instance with multiple GPUs running our training in parallel, hopefully keeping costs roughly constant. A quick search brought us to this distributed training section in the Hugging Face docs.
Of the two distributed training strategies described we decided to take the data parallel approach. Model parallel training, which divides layers of a single copy of the model between GPUs, seemed conceptually a bit trickier and more suited to the case where the model has a very large number of parameters and is therefore difficult to fit into memory.
Since we are using the Trainer API, enabling data parallel training was as easy as adding the following distribution parameter to our Hugging Face Estimator:
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}
and choosing one of the supported instance types. For our availability zone this is currently one of ml.p3.16xlarge, ml.p3dn.24xlarge or ml.p4d.24xlarge, all with 8 GPUs per instance. This gives the following Estimator:
huggingface_estimator = HuggingFace( entry_point='train.py', source_dir='s3://bucket/scripts/', image_uri=TRAIN_IMAGE, instance_type='ml.p3.16xlarge', instance_count=1, role=role, hyperparameters = hyperparameters, distribution = distribution, )
When this distribution parameter is added, the Trainer will make use of the SageMaker Distributed Data Parallel Library.
The data parallel approach is to create an identical copy of the model and training script for each GPU and distribute the data in each training batch between these. These ‘batch shards’ are processed in parallel to produce separate sets of parameter updates, which are averaged using an AllReduce algorithm and applied across the model copies. This can be extended to multiple nodes (instances), each with multiple GPUs, to build up a data parallel cluster.
In practice there is some overhead associated with the intercommunication between nodes/GPUs, but the reduction in training time scales almost linearly with the number of available GPUs. In our case we cut that 30 hour estimate down to just over 5 hours, meaning it cost us roughly the same amount to train our model in a fraction of the time.