Jesus Leal ML, Data Science and Deep Learning

Using RoBERTA for text classification

One of the most interesting architectures derived from the BERT revolution is RoBERTA, which stands for Robustly Optimized BERT Pretraining Approach. The authors of the paper found that while BERT provided and impressive performance boost across multiple tasks it was undertrained. They suggest a series of modifications to the original BERT architecture to achieve SOTA across multiple tasks such as:

  • Training the model for longer with bigger batches and more data
  • Removing the next sentence prediction objective
  • Dynamically masking during pretraining.

RoBERTa also uses a different tokenizer, byte-level BPE (same as GPT-2), than BERT and has a larger vocabulary (50k vs 30k). The authors of the paper recognize that having larger vocabulary that allows the model to represent any word results in more parameters (15 million more for base RoBERTA), but the increase in complexity is justified by gains in performance. For a nice overview of BERT I recommend this tutorial with in depth explanation by Chris McCormick.

In this post I will explore how to use RoBERTa for text classification with the Huggingface libraries Transformers as well as Datasets (formerly known as nlp). For this tutorial I chose the famous IMDB dataset. I made this decision two reasons 1)IMDB is a standard dataset used in many papers so our average reader is more likely to know or have worked with this dataset; 2) This is a good pretext to get to know better the datasets library. I also wanted to get more familiar with some of the new tools introduced by the Transformers library such as the native Trainer class. The most recent version of the Hugging Face library highlights how easy it is to train a model for text classification with this new helper class.

This is not an extensive exploration of neither RoBERTa or BERT but should be seen as a practical guide on how to use it for your own projects.

import pandas as pd
import datasets
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification,Trainer, TrainingArguments
import torch.nn as nn
import torch
from import Dataset, DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
import wandb
import os

The datasets library handles the hassle of downloading and processing nlp datasets which is quite convenient to save time in processing and use it for modelling. First we need to instantiate the class by calling the method load_dataset. In case the dataset is not loaded, the library downloads it and saves it in the datasets default folder.

This example provided by HuggingFace uses an older version of datasets (still called nlp) and demonstrates how to user the trainer class with BERT. Todays tutorial will follow several of the concepts described there.

The dataset class has multiple useful methods to easily load, process and apply transformations to the dataset. We can even load the data and split it into train and test feeding a list to the split argument.

train_data, test_data = datasets.load_dataset('imdb', split =['train', 'test'])

The resulting objects contains an arrow dataseta format optimized to work with all the attributes of the original dataset, including the original text, label, types, number of rows, etc.


We can operate straigh into the dataset and tokenize the text using another one of the Hugging Face libraries Tokenizers. That library provides Rust optimized code to process the data and return all the necessary inputs for the model such as masks, token ids, etc. We simply load the corresponding model by specifying the name of the model and the tokenizer; if we want to use a finetuned model or a model trained from scratch simply change the name of the model to the location of the pretrained model.

We can apply the tokenizer to the train and test subsets using the FastTokenizerFromPretrained class from the Transformers library. To do that we simply define a function that makes a call to the tokenizer class. We can specify if we want to add padding, if we want to truncate sentences that are longer than the maximum lenght established, etc. The method returns a batch_encode class that holds all the necessary inputs for the model such as tokens, attention_masks, etc. We then can use the map function and apply the tokenizer function to all the elements of all the splits in dataset.

# load model and tokenizer and define length of the text sequence
model = RobertaForSequenceClassification.from_pretrained('roberta-base')
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base', max_length = 512)
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.bias', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
# define a function that will tokenize the model, and will return the relevant inputs for the model
def tokenization(batched_text):
    return tokenizer(batched_text['text'], padding = True, truncation=True)

train_data =, batched = True, batch_size = len(train_data))
test_data =, batched = True, batch_size = len(test_data))

Once the tokenization process is finished we can use the set the column names and types.

train_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
test_data.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

The trainer helper class is designed to facilitate the finetuning of models using the Transformers library. The Trainer class depends on another class called TrainingArguments that contains all the attributes to customize the training. TrainingArguments contains useful parameter such as output directory to save the state of the model, number of epochs to fine tune a model, use of mixed precision tensors (available with the Apex library), warmup steps, etc. Using the same class we can also ask the model to evaluate the model at the end of each training epoch rather than after a determined amount of steps. To make sure we evaluate at the end of the training epoch we set evaluation_strategy = 'Epoch'. For this case we also set the option load_best_model_at_end to true, this will guarantee that we will load the best model for evaluation (according to the metrics defined) at the end of training.

The Trainer class provides also allows to implement more sophisticated optmizers and learning rates which can be fed in the optimizer option. For this tutorial I use the default gradient descent optimization algorithm provided by the library AdamW. AdamW is an optimization based on the original Adam(Adaptive Moment Estimation) that incorporates a regularization term designed to work well with adaptive optimizers; a pretty good discussion of Adam, AdamW and the importance of regularization can be found here. The class also uses a default scheduler to modify the learning rate as the training of the model progresses. The default scheduler on the trainer class is get_linear_schedule_with_warmup an scheduler that decreases the learning rate linearly until it reaches zero. As mentioned before we can also modify the default values to use a different scheduler. For the learning rate I chose the default of 5e-5 as I wanted to be conservative since this an already pretrained model. Further Sun et al found that a learning rate of 5e-5 works well for text classification. I did not modify any of the other parameters of AdamW.

Trainer also makes accumulating gradient steps pretty straightforward. This is relevant when we need to train models on smaller GPU’s. For this tutorial I will be using a GeForce GTX 1080 that has 8GB of RAM. Given the size of the models (in this case 125 million parameters) and the limitation of the memory.

We can also define if we want to log the training into wanddb. Wandb, short for Weights and Biasis, is a service that allows you visualize the performance of your model and parameters ina very nice dashboad. In this tutorial I assumed you have wandb installed and configured to log the information of weights and parameters. A detailed tutorial of wandb can be found here. We define the name of the run with run_name in the TrainingArguments class to easily keep track of the model.

Finally we can also specify the metrics to evaluate the performance of the model on the test set with the compute_metrics argument in the Trainer class. In this example I selected accuracy, f1 score, precision and recall as suggested in the tutorial by Hugging Face and wrapped them in a functiont hat returns the values for these metrics. This set of metrics provide a very good idea on the performance of the model.

# define accuracy metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
# define the training arguments
training_args = TrainingArguments(
    output_dir = '/media/jlealtru/data_files/github/website_tutorials/results',
    per_device_train_batch_size = 4,
    gradient_accumulation_steps = 16,    
    per_device_eval_batch_size= 8,
    evaluation_strategy = "epoch",
    disable_tqdm = False, 
    logging_steps = 8,
    fp16 = True,
    dataloader_num_workers = 8,
    run_name = 'roberta-classification'
# instantiate the trainer class and check for available devices
trainer = Trainer(
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# train the model
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=6250.0, style=ProgressStyle(description_w…

/home/jlealtru/anaconda3/envs/torch/lib/python3.6/site-packages/torch/optim/ UserWarning: Seems like `optimizer.step()` has been overridden after learning rate scheduler initialization. Please, make sure to call `optimizer.step()` before `lr_scheduler.step()`. See more details at
  "", UserWarning)

Logging results to Weights & Biases (Documentation).
Project page:
Run page:

wandb: Wandb version 0.10.7 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade

{'eval_loss': 0.1449167009449005, 'eval_accuracy': 0.94616, 'eval_f1': 0.9456951504881789, 'eval_precision': 0.9539313039231646, 'eval_recall': 0.9376, 'epoch': 0.9984, 'total_flos': 9572902656000000, 'step': 390}

{'eval_loss': 0.14722128438472748, 'eval_accuracy': 0.95012, 'eval_f1': 0.9501738122827347, 'eval_precision': 0.9491498363534765, 'eval_recall': 0.9512, 'epoch': 1.9984, 'total_flos': 19145805312000000, 'step': 780}

{'eval_loss': 0.14871138486862182, 'eval_accuracy': 0.95652, 'eval_f1': 0.9568188138084455, 'eval_precision': 0.9502880138877929, 'eval_recall': 0.96344, 'epoch': 2.9984, 'total_flos': 28703391323750400, 'step': 1170}

After the training has been completed we can evaluate the performance of the model and make sure we are loading the right model.

HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=3125.0, style=ProgressStyle(description_…

The best iteration of our model achieved an accuracy 0.9565, which would put us on third place in the leaderboard of sentiment analysis classification with IMDB.

Thats it for this tutorial, hopefully you will find this helpful. The full version of this notebook can be found here.