RoBERTa(Robustly optimized BERT approach), which is implemented in PyTorch, modifies key hyperparameters in BERT, including removing BERT’s next-sentence pretraining objective, and training with much larger mini-batches and learning rates. This allows RoBERTa to improve on the masked language modeling objective compared with BERT and leads to better downstream task performance.
RoBERTa is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.
This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks.
Now, coming to our task, as we know with modern technology and lots of innovation, vast amounts of information are available on the internet in the form of data. So it is imperative to us to extract the information quickly and most efficiently from vast amounts of data, and it is tough for human beings to extract the information from large documents in less time and efficiently. Now, comes the Natural Language Processing(NLP) for your rescue(hallelujah!!) which can automatically short the longer text and give the summary, which reduces the reading time and accelerates the process of exploring the new information. There are two ways for text summarization technique in Natural language preprocessing; one is extraction-based summarization, and another is abstraction based summarization.
In extraction-based summarization, we make a summary by extracting keyphrases from the source document and combine them. So in extraction, the summary word or sentences are present in the text document, so sometimes the summary is grammatically strange. In abstraction, text summarization algorithms create new phrases and sentences that conclude the original text. Those words or sentences may or may not be present in the original document. So the abstraction text summarization gives a summary like humans summarize the long text, so it reduces inconsistency of a text document grammatically.
In this tutorial, we are going to understand step by step implementation of RoBERTa on the Abstractive Text Summarization task and Summarize the Reviews written by Amazon’s users. The dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all 500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review.
The data is available on the Kaggle Amazon Fine Food Reviews. Sounds Cool??
This tutorial is intended as a straightforward guide to utilizing these amazing models brought to us by Hugging Face for text summarization task. Hugging Face is a very popular library providing pre-trained models for implementing various state-of-the-art transformers. You can check out their official documentation for more such interesting features. As such now, I won’t be talking about the theory behind the networks, or how they work under the table. If you want to still scuba-diving into the network you can refer to the official paper by Liu et al.
Ready?? Let’s start!!
I have divided the tutorial into nine-folds and will be using Python3 for coding.
- Installing required Libraries.
- Importing the dataset of ‘Amazon Food Reviews’ and Prepare Dataset.
- Tokenizing the data.
- Encoder-Decoder Model.
- Parameter Setting.
- Metric Definition.
- Training the model.
- Evaluating Using Rouge Scores.
- Predictions on Test data.
STEP 1: Installing required Libraries
We’ll be using RobertaTokenizerFast. This tokenizer inherits from
PreTrainedTokenizer which contains most of the main methods. Whereas, we have used the
EncoderDecoderModel to initialize a sequence-to-sequence model with any pretrained autoencoding model as the encoder and any pretrained autoregressive model as the decoder can be used.
STEP 2: Importing the dataset of ‘Amazon Food Reviews’.
Now, here in this step will be using the downloaded ‘.csv’ file from Kaggle Amazon Reviews Dataset and then dropped the irrelevant columns and only keeping ‘Summary’ and ‘Text’ columns. Then, dropping those duplicates from the dataframe. We will be using Pandas library for this stuff. The final dataframe contains 5,68,427 instances.
Now, the pandas DataFrame is to be converted into the ‘dataset’ library. Hugging Face Datasets is a lightweight and extensible library to easily share and access datasets and evaluation metrics for Natural Language Processing (NLP). Built-in interoperability with Numpy, Pandas, PyTorch, and Tensorflow 2 Lightweight and fast with a transparent and pythonic API Strive on large datasets.
STEP 3: Tokenizing the data.
Alright, let’s prepare the training data. We have chosen
decoder_max_length=8 as 94 percent of the text and summary lie in this range. Next, we make use of
.map() to transform each training batch into a batch of model inputs. Since we know that the maximum length that
roberta-base can process amounts to 512, we are also interested in the percentage of input samples being longer than the maximum length. Alternatively, one could leverage long-range sequence models, such as Longformer to be used as the encoder.
"Summary" are tokenized and prepared as the Encoder's
"input_ids" and Decoder's
STEP 4: Encoder-Decoder Model
Now, we will instantiate an encoder and a decoder
roberta-base from pretrained model checkpoints from two model identifiers — one for the encoder and one for the decoder. After such an Encoder-Decoder model has been trained/fine-tuned, it can be saved/loaded just like any other model. This model is also a PyTorch torch.nn.Module subclass. You can learn about encoder-decoder working from here. And if you want to learn more about Encoder-Decoder architecture check out this Notebook.
STEP 5: Parameter Setting
Now, we will define the relevant parameters which are then used in beam searching while decoding. Let’s start by setting the special tokens.
bert-base-cased does not have a
eos_token_id, so we will use it
sep_token_id respectively. Also, we should define a
pad_token_id on the config and make sure the correct
vocab_size is set. Next, let’s define all parameters related to beam search decoding.
Alright, let’s now start fine-tuning the RoBERTAshared model.
STEP 6: Metric Computation.
We will be using Rouge Score for evaluation purposes. Also, we need to define a function to correctly compute the ROUGE score during validation. ROUGE is a much better metric to track during training than only language modeling loss. The argument
predict_with_generate should be set to
True, so that the
Seq2SeqTrainer runs the
generate(...) on the validation data and passes the generated output as
predictions to the
compute_metric(...) the function which we will define later. For more information on the
Seq2SeqTrainer, we advise you to refer to its code.
STEP 7: Training the model.
Seq2SeqTrainingArguments is used to feed the
training_args of the
Seq2SeqTrainer. Great, now we can pass all arguments to the
Seq2SeqTrainer and start finetuning. .Now, the
compute_metric, training and validation data are passed into the
Seq2SeqTrainer. Executing the following cell will take around 11 hours(wiped out!!). You can decrease the fine-tuning time by taking smaller data. The training metric stores the training loss, validation loss, and global steps so that visualizations regarding the training.
Great!!! Now we got the fully tuned model with the respective checkpoints. Finally, it’s time to check for evaluations on the test data. Relief???
STEP 8: Evaluating Using Rouge Scores.
Now, we are going to evaluate our RoBERTa model on the
test_data. You can use the latest checkpoint of the model if you have taken a nap meanwhile in training and runtime got stopped (most probably!!! huh?). Otherwise, we are good to go with the trained model.
For the sake of assumption that you are loading a checkpoint model, we again initialize the tokenizer. Now, we can again leverage the dataset’s handy
map() function to generate a summary for each test sample. For each data sample we:
- first, tokenize the
- second, generate the output token ids, and
- third, decode the output token ids to obtain our predicted summary.
Let’s run the map function to obtain the results dictionary that has the model’s predicted summary stored for each sample. Executing the following cell may take around 10min. At last, we then evaluate the predicted summaries using the ROUGE score.
The model has achieved a Rouge-1, Rouge-2, and Rouge-L of 25.67,15.94 and 25.67 respectively in terms of F-measure.
STEP 9: Predictions on Test data.
Following are some of the output predictions of our model on some test data example:
We find that fine-tuning RoBERTa performs extremely well on our dataset and is really simple to implement thanks to the open-source Huggingface Transformers library. This can be extended to any text summarization dataset without any hassle.
1. Anubhav(Author): Pre-processing,Model Training, Literature Survey, Evaluation,Parameter Tuning Other variation implementation: DistilRobertA,BERT, Bi-directional LSTM | LinkedIn
2. Skotiyal: T5 implementation, Literature Survey| LinkedIn
3. DeDrishti: Data Exploration,LSTM Implementation | LinkedIn
A special Thanks to:
for their constant support and guidance.
Special Mention of https://huggingface.co/patrickvonplaten for being the inspiration of this tutorial.
Thank you for your time.
Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models
Transformer-based encoder-decoder models were proposed in Vaswani et al. (2017) and have recently experienced a surge…
imanubhav - Overview
IIIT-DELHI, Natural Language Processing Pythonist, ML enthusiast Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec…
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Language model pretraining has led to significant performance gains but the careful comparison between different approaches…