How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?

Question:

Typical EncoderDecoderModel that works on a Pre-coded Dataset

The code snippet snippet as below is frequently used to train an EncoderDecoderModel from Huggingface’s transformer library

from transformers import EncoderDecoderModel
from transformers import PreTrainedTokenizerFast

multibert = EncoderDecoderModel.from_encoder_decoder_pretrained(
    "bert-base-multilingual-uncased", "bert-base-multilingual-uncased"
)


tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-multilingual-uncased")

...

And a pre-processed/coded dataset can be used to train the model as such, when using the wmt14 dataset:

import datasets

train_data = datasets.load_dataset("wmt14", "de-en", split="train")
val_data = datasets.load_dataset("wmt14", "de-en", split="validation[:10%]")


from functools import partial

def process_data_to_model_inputs(batch, encoder_max_length=512, decoder_max_length=512, batch_size=2): 
    inputs = tokenizer([segment["en"] for segment in batch['translation']], 
                       padding="max_length", truncation=True, max_length=encoder_max_length)
    outputs = tokenizer([segment["de"] for segment in batch['translation']], 
                       padding="max_length", truncation=True, max_length=encoder_max_length)


    batch["input_ids"] = inputs.input_ids
    batch["attention_mask"] = inputs.attention_mask
    batch["decoder_input_ids"] = outputs.input_ids
    batch["decoder_attention_mask"] = outputs.attention_mask
    batch["labels"] = outputs.input_ids.copy()

    # because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`. 
    # We have to make sure that the PAD token is ignored
    batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
    return batch


def munge_dataset_to_pacify_bert(dataset, encoder_max_length=512, decoder_max_length=512, batch_size=2):
    bert_wants_to_see = ["input_ids", "attention_mask", "decoder_input_ids", 
                         "decoder_attention_mask", "labels"]
    
    _process_data_to_model_inputs = partial(process_data_to_model_inputs, 
                                                encoder_max_length=encoder_max_length, 
                                                decoder_max_length=decoder_max_length, 
                                                batch_size=batch_size
                                           )
    dataset = dataset.map(_process_data_to_model_inputs, 
                           batched=True, 
                           batch_size=batch_size
                          )
    dataset.set_format(type="torch", columns=bert_wants_to_see)
    return dataset

train_data = munge_dataset_to_pacify_bert(train_data)
val_data = munge_dataset_to_pacify_bert(val_data)

Then the training can be done easily as such:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments


# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
    output_dir="./",
    evaluation_strategy="steps",
    ...
)


# instantiate trainer
trainer = Seq2SeqTrainer(
    model=multibert,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=val_data,
)

trainer.train()

A working example can be found on something like: https://www.kaggle.com/code/alvations/neural-plasticity-bert2bert-on-wmt14

However, parallel data used to train an EncoderDecoderModel usually exists as .txt or .tsv files, not a pre-coded dataset

Given a large .tsv file (e.g. 1 billion lines), e.g.

hello worldtHallo Welt
how are you?twie gehts?
...t...

Step 1: we can convert into the parquet / pyarrow format, one can do something like:

import vaex  # Using vaex 
import sys

filename = "train.en-de.tsv"

df = vaex.from_csv(filename, sep="t", header=None, names=["src", "trg"], convert=True, chunk_size=50_000_000)

df.export(f"{filename}.parquet")

Step 2: Then we will can read it into a Pyarrow table to fit into the datasets.Dataset object and use the munge_dataset_to_pacify_bert() as shown above, e.g

from datasets import Dataset, load_from_disk
import pyarrow as pa

_ds = Dataset(pa.compute.drop_null(pa.parquet.read_table('train.en-de.tsv.parquet')
_ds.save_to_disk('train.en-de.tsv.parquet.hfdataset')

_ds = load_from_disk('train.en-de.tsv.parquet.hfdataset')

train_data = munge_dataset_to_pacify_bert(_ds)

train_data.save_to_disk('train.en-de.tsv.parquet.hfdataset')

While the process above works well for small-ish dataset, e.g. 1-5 million lines of data, when the scale of the goes to 500 million to 1 billion, the last .save_to_disk() function seems like it is runningf "forever" and the end is no where in sight.

Breaking down the steps in the munge_dataset_to_pacify_bert(), there are 2 sub-functions:

  • dataset.map(_process_data_to_model_inputs, batched=True, batch_size=batch_size)
  • dataset.set_format(type="torch", columns=bert_wants_to_see)

For the .map() process, it’s possible to scale in parallel threads by specifying by

dataset.map(_process_data_to_model_inputs, 
    batched=True, batch_size=100, 
    num_proc=32  # num of parallel threads.
    )

And when I tried to process with

  • num_proc=32
  • batch_size=100

The .map() function finishes the processing of 500 million lines in 18 hours of compute time on Intel Xeon E5-2686 @ 2.3GHz with 32 processor cores, optimally.

But somehow the .map() function created 32 temp .arrow files and 128 tmp... binary files. Seemingly the last save_to_disk function has been running for more than 10+ hours and have not finished combining the temp files in parts to save the final HF Dataset to disk.


Given the above context, my questions in parts are:

Question (Part 1): When the mapping function ends and created the temp .arrow and tmp... files, is there a way to read these individually instead of try to save them into a final directory using the save_to_disk() function?


Question (Part 2): Why is the save_to_disk() function so slow after the mapping and how can the mapped processed data be saved in a faster manner?


Question (Part 3): Is there a way to avoid the .set_format() function after the .map() and make it part of the _process_data_to_model_inputs function?


Asked By: alvas

||

Answers:

TL;DR

(Answer’s credits goes to @lhoestq)

If you have a TSV file that looks like this:

hello worldtHallo Welt
how are you?twie gehts?
...t...

load the dataset as such:

# tatoeba-sentpairs.tsv is a pretty large file.
ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="t", split="train")


In Long

Reason not to use parquet, run map functions and save the outputs:

  • Loading a large dataset into parquet is already quite a feat, in the thread, see Step 1 in question, so lets avoid that
  • Mapping the data into the BERT format, i.e. munge_dataset_to_pacify_bert is also quite expensive operation. If that is done for 1B lines and even if it’s thread-parallelized, it will take hours to days to complete
  • The resulting tensors that are saved with dataset.set_format(type="torch") is massive, a ~50GB of tsv with 1B lines will easily become TBs of binaries.

Instead, use stream-style processing,

Huggingface datasets supports it with stream=True when defining the dataset:

ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv", 
                  streaming=True, delimiter="t", split="train")
Answered By: alvas

How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?

A parallel corpus is a collection of texts in two or more languages that are aligned or translated versions of each other. For example, a parallel corpus of English and French sentences can be used to train a machine translation model that can translate from one language to another.

A Huggingface dataset is a standardized and lightweight way of handling and processing data for natural language processing (NLP) tasks. It provides various features such as caching, streaming, filtering, shuffling, and splitting of data. A Huggingface dataset can be created from various sources, such as local files, online files, pandas dataframes, or in-memory data.

An EncoderDecoderModel is a type of model that consists of two sub-models: an encoder and a decoder. The encoder takes an input sequence and encodes it into a hidden representation. The decoder takes the hidden representation and generates an output sequence. An EncoderDecoderModel can be used for various NLP tasks that involve generating sequences, such as machine translation, text summarization, or text generation.

To efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel, you can follow these steps:

Step 1: Load the parallel corpus from local files or online files

You can use the load_dataset function from the datasets library to load the parallel corpus from local files or online files. You need to specify the format of the files, such as csv, json, or txt, and the column names or fields that contain the source and target texts. For example, if you have a parallel corpus of English and French sentences in two csv files, you can load them as follows:

from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files={
        "en": "path/to/english.csv",
        "fr": "path/to/french.csv",
    },
    column_names=["text"],
)

Step 2: Preprocess the source and target texts

You can use the map method of the dataset object to apply some preprocessing functions to the source and target texts. For example, you can tokenize the texts using a tokenizer from the transformers library, such as BertTokenizer or MarianTokenizer. You can also truncate or pad the texts to a fixed length, and add special tokens such as <s> and </s> to mark the start and end of the sequences. For example, if you want to use a MarianTokenizer for English and French, you can preprocess the texts as follows:

from transformers import MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

def preprocess_function(examples):
    inputs = [ex["text"] for ex in examples["en"]]
    outputs = [ex["text"] for ex in examples["fr"]]
    inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    outputs = tokenizer(outputs, max_length=128, truncation=True, padding="max_length")
    inputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in inputs["input_ids"]]
    outputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in outputs["input_ids"]]
    return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "decoder_input_ids": outputs["input_ids"], "decoder_attention_mask": outputs["attention_mask"]}

dataset = dataset.map(preprocess_function, batched=True)

Step 3: Split the dataset into train, validation, and test sets

You can use the train_test_split method of the dataset object to split the dataset into train, validation, and test sets. You need to specify the ratio or size of each set, and optionally a random seed for reproducibility. For example, if you want to split the dataset into 80% train, 10% validation, and 10% test sets, you can do as follows:

dataset = dataset.train_test_split(test_size=0.1, seed=42)
test_dataset = dataset["test"]

dataset = dataset["train"].train_test_split(test_size=0.1111, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]

Step 4: Train the EncoderDecoderModel using the Trainer class from the transformers library

You can use the Trainer class from the transformers library to train the EncoderDecoderModel using the Huggingface dataset. You need to specify the model, the training arguments, the data collator, the training and validation datasets, and optionally the evaluation metrics. For example, if you want to train a MarianMTModel for English to French translation, you can do as follows:

from transformers import MarianMTModel, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, MarianConfig

model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")

training_args = Seq2SeqTrainingArguments(
    output_dir="path/to/output",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=100,
    save_steps=500,
    evaluation_strategy="steps",
    predict_with_generate=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

trainer.train()

Step 5: Evaluate the EncoderDecoderModel on the test set using the Trainer class

You can use the evaluate method of the trainer object to evaluate the EncoderDecoderModel on the test set. You can also use the predict method to generate predictions on the test set. You can use the compute_metrics function to define the evaluation metrics, such as BLEU, ROUGE, or METEOR. For example, if you want to evaluate the MarianMTModel on the test set using BLEU, you can do as follows:

from datasets import load_metric

metric = load_metric("sacrebleu")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = [tokenizer.decode(pred, skip_special_tokens=True) for pred in predictions]
    labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
    return metric.compute(predictions=predictions, references=labels)

trainer.compute_metrics = compute_metrics

test_results = trainer.evaluate(test_dataset)
test_predictions = trainer.predict(test_dataset)
Answered By: Ahmed Mohamed