How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?
Question:
Typical EncoderDecoderModel that works on a Pre-coded Dataset
The code snippet snippet as below is frequently used to train an EncoderDecoderModel
from Huggingface’s transformer library
from transformers import EncoderDecoderModel
from transformers import PreTrainedTokenizerFast
multibert = EncoderDecoderModel.from_encoder_decoder_pretrained(
"bert-base-multilingual-uncased", "bert-base-multilingual-uncased"
)
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
...
And a pre-processed/coded dataset can be used to train the model as such, when using the wmt14
dataset:
import datasets
train_data = datasets.load_dataset("wmt14", "de-en", split="train")
val_data = datasets.load_dataset("wmt14", "de-en", split="validation[:10%]")
from functools import partial
def process_data_to_model_inputs(batch, encoder_max_length=512, decoder_max_length=512, batch_size=2):
inputs = tokenizer([segment["en"] for segment in batch['translation']],
padding="max_length", truncation=True, max_length=encoder_max_length)
outputs = tokenizer([segment["de"] for segment in batch['translation']],
padding="max_length", truncation=True, max_length=encoder_max_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["decoder_attention_mask"] = outputs.attention_mask
batch["labels"] = outputs.input_ids.copy()
# because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`.
# We have to make sure that the PAD token is ignored
batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
return batch
def munge_dataset_to_pacify_bert(dataset, encoder_max_length=512, decoder_max_length=512, batch_size=2):
bert_wants_to_see = ["input_ids", "attention_mask", "decoder_input_ids",
"decoder_attention_mask", "labels"]
_process_data_to_model_inputs = partial(process_data_to_model_inputs,
encoder_max_length=encoder_max_length,
decoder_max_length=decoder_max_length,
batch_size=batch_size
)
dataset = dataset.map(_process_data_to_model_inputs,
batched=True,
batch_size=batch_size
)
dataset.set_format(type="torch", columns=bert_wants_to_see)
return dataset
train_data = munge_dataset_to_pacify_bert(train_data)
val_data = munge_dataset_to_pacify_bert(val_data)
Then the training can be done easily as such:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
...
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
)
trainer.train()
A working example can be found on something like: https://www.kaggle.com/code/alvations/neural-plasticity-bert2bert-on-wmt14
However, parallel data used to train an EncoderDecoderModel usually exists as .txt
or .tsv
files, not a pre-coded dataset
Given a large .tsv
file (e.g. 1 billion lines), e.g.
hello worldtHallo Welt
how are you?twie gehts?
...t...
Step 1: we can convert into the parquet / pyarrow format, one can do something like:
import vaex # Using vaex
import sys
filename = "train.en-de.tsv"
df = vaex.from_csv(filename, sep="t", header=None, names=["src", "trg"], convert=True, chunk_size=50_000_000)
df.export(f"{filename}.parquet")
Step 2: Then we will can read it into a Pyarrow table to fit into the datasets.Dataset
object and use the munge_dataset_to_pacify_bert()
as shown above, e.g
from datasets import Dataset, load_from_disk
import pyarrow as pa
_ds = Dataset(pa.compute.drop_null(pa.parquet.read_table('train.en-de.tsv.parquet')
_ds.save_to_disk('train.en-de.tsv.parquet.hfdataset')
_ds = load_from_disk('train.en-de.tsv.parquet.hfdataset')
train_data = munge_dataset_to_pacify_bert(_ds)
train_data.save_to_disk('train.en-de.tsv.parquet.hfdataset')
While the process above works well for small-ish dataset, e.g. 1-5 million lines of data, when the scale of the goes to 500 million to 1 billion, the last .save_to_disk()
function seems like it is runningf "forever" and the end is no where in sight.
Breaking down the steps in the munge_dataset_to_pacify_bert()
, there are 2 sub-functions:
dataset.map(_process_data_to_model_inputs, batched=True, batch_size=batch_size)
dataset.set_format(type="torch", columns=bert_wants_to_see)
For the .map()
process, it’s possible to scale in parallel threads by specifying by
dataset.map(_process_data_to_model_inputs,
batched=True, batch_size=100,
num_proc=32 # num of parallel threads.
)
And when I tried to process with
num_proc=32
batch_size=100
The .map()
function finishes the processing of 500 million lines in 18 hours of compute time on Intel Xeon E5-2686 @ 2.3GHz with 32 processor cores, optimally.
But somehow the .map()
function created 32 temp .arrow
files and 128 tmp...
binary files. Seemingly the last save_to_disk
function has been running for more than 10+ hours and have not finished combining the temp files in parts to save the final HF Dataset to disk.
Given the above context, my questions in parts are:
Question (Part 1): When the mapping function ends and created the temp .arrow
and tmp...
files, is there a way to read these individually instead of try to save them into a final directory using the save_to_disk()
function?
Question (Part 2): Why is the save_to_disk()
function so slow after the mapping and how can the mapped processed data be saved in a faster manner?
Question (Part 3): Is there a way to avoid the .set_format()
function after the .map()
and make it part of the _process_data_to_model_inputs
function?
Answers:
TL;DR
(Answer’s credits goes to @lhoestq)
If you have a TSV file that looks like this:
hello worldtHallo Welt
how are you?twie gehts?
...t...
load the dataset as such:
# tatoeba-sentpairs.tsv is a pretty large file.
ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv",
streaming=True, delimiter="t", split="train")
In Long
Reason not to use parquet, run map functions and save the outputs:
- Loading a large dataset into parquet is already quite a feat, in the thread, see Step 1 in question, so lets avoid that
- Mapping the data into the BERT format, i.e.
munge_dataset_to_pacify_bert
is also quite expensive operation. If that is done for 1B lines and even if it’s thread-parallelized, it will take hours to days to complete
- The resulting tensors that are saved with
dataset.set_format(type="torch")
is massive, a ~50GB of tsv with 1B lines will easily become TBs of binaries.
Instead, use stream-style processing,
Huggingface datasets
supports it with stream=True
when defining the dataset:
ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv",
streaming=True, delimiter="t", split="train")
How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?
A parallel corpus is a collection of texts in two or more languages that are aligned or translated versions of each other. For example, a parallel corpus of English and French sentences can be used to train a machine translation model that can translate from one language to another.
A Huggingface dataset is a standardized and lightweight way of handling and processing data for natural language processing (NLP) tasks. It provides various features such as caching, streaming, filtering, shuffling, and splitting of data. A Huggingface dataset can be created from various sources, such as local files, online files, pandas dataframes, or in-memory data.
An EncoderDecoderModel is a type of model that consists of two sub-models: an encoder and a decoder. The encoder takes an input sequence and encodes it into a hidden representation. The decoder takes the hidden representation and generates an output sequence. An EncoderDecoderModel can be used for various NLP tasks that involve generating sequences, such as machine translation, text summarization, or text generation.
To efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel, you can follow these steps:
Step 1: Load the parallel corpus from local files or online files
You can use the load_dataset
function from the datasets
library to load the parallel corpus from local files or online files. You need to specify the format of the files, such as csv, json, or txt, and the column names or fields that contain the source and target texts. For example, if you have a parallel corpus of English and French sentences in two csv files, you can load them as follows:
from datasets import load_dataset
dataset = load_dataset(
"csv",
data_files={
"en": "path/to/english.csv",
"fr": "path/to/french.csv",
},
column_names=["text"],
)
Step 2: Preprocess the source and target texts
You can use the map
method of the dataset
object to apply some preprocessing functions to the source and target texts. For example, you can tokenize the texts using a tokenizer from the transformers
library, such as BertTokenizer
or MarianTokenizer
. You can also truncate or pad the texts to a fixed length, and add special tokens such as <s>
and </s>
to mark the start and end of the sequences. For example, if you want to use a MarianTokenizer
for English and French, you can preprocess the texts as follows:
from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
def preprocess_function(examples):
inputs = [ex["text"] for ex in examples["en"]]
outputs = [ex["text"] for ex in examples["fr"]]
inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
outputs = tokenizer(outputs, max_length=128, truncation=True, padding="max_length")
inputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in inputs["input_ids"]]
outputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in outputs["input_ids"]]
return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "decoder_input_ids": outputs["input_ids"], "decoder_attention_mask": outputs["attention_mask"]}
dataset = dataset.map(preprocess_function, batched=True)
Step 3: Split the dataset into train, validation, and test sets
You can use the train_test_split
method of the dataset
object to split the dataset into train, validation, and test sets. You need to specify the ratio or size of each set, and optionally a random seed for reproducibility. For example, if you want to split the dataset into 80% train, 10% validation, and 10% test sets, you can do as follows:
dataset = dataset.train_test_split(test_size=0.1, seed=42)
test_dataset = dataset["test"]
dataset = dataset["train"].train_test_split(test_size=0.1111, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]
Step 4: Train the EncoderDecoderModel using the Trainer
class from the transformers
library
You can use the Trainer
class from the transformers
library to train the EncoderDecoderModel using the Huggingface dataset. You need to specify the model, the training arguments, the data collator, the training and validation datasets, and optionally the evaluation metrics. For example, if you want to train a MarianMTModel
for English to French translation, you can do as follows:
from transformers import MarianMTModel, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, MarianConfig
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
training_args = Seq2SeqTrainingArguments(
output_dir="path/to/output",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
logging_steps=100,
save_steps=500,
evaluation_strategy="steps",
predict_with_generate=True,
)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()
Step 5: Evaluate the EncoderDecoderModel on the test set using the Trainer
class
You can use the evaluate
method of the trainer
object to evaluate the EncoderDecoderModel on the test set. You can also use the predict
method to generate predictions on the test set. You can use the compute_metrics
function to define the evaluation metrics, such as BLEU, ROUGE, or METEOR. For example, if you want to evaluate the MarianMTModel
on the test set using BLEU, you can do as follows:
from datasets import load_metric
metric = load_metric("sacrebleu")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = [tokenizer.decode(pred, skip_special_tokens=True) for pred in predictions]
labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
return metric.compute(predictions=predictions, references=labels)
trainer.compute_metrics = compute_metrics
test_results = trainer.evaluate(test_dataset)
test_predictions = trainer.predict(test_dataset)
Typical EncoderDecoderModel that works on a Pre-coded Dataset
The code snippet snippet as below is frequently used to train an EncoderDecoderModel
from Huggingface’s transformer library
from transformers import EncoderDecoderModel
from transformers import PreTrainedTokenizerFast
multibert = EncoderDecoderModel.from_encoder_decoder_pretrained(
"bert-base-multilingual-uncased", "bert-base-multilingual-uncased"
)
tokenizer = PreTrainedTokenizerFast.from_pretrained("bert-base-multilingual-uncased")
...
And a pre-processed/coded dataset can be used to train the model as such, when using the wmt14
dataset:
import datasets
train_data = datasets.load_dataset("wmt14", "de-en", split="train")
val_data = datasets.load_dataset("wmt14", "de-en", split="validation[:10%]")
from functools import partial
def process_data_to_model_inputs(batch, encoder_max_length=512, decoder_max_length=512, batch_size=2):
inputs = tokenizer([segment["en"] for segment in batch['translation']],
padding="max_length", truncation=True, max_length=encoder_max_length)
outputs = tokenizer([segment["de"] for segment in batch['translation']],
padding="max_length", truncation=True, max_length=encoder_max_length)
batch["input_ids"] = inputs.input_ids
batch["attention_mask"] = inputs.attention_mask
batch["decoder_input_ids"] = outputs.input_ids
batch["decoder_attention_mask"] = outputs.attention_mask
batch["labels"] = outputs.input_ids.copy()
# because BERT automatically shifts the labels, the labels correspond exactly to `decoder_input_ids`.
# We have to make sure that the PAD token is ignored
batch["labels"] = [[-100 if token == tokenizer.pad_token_id else token for token in labels] for labels in batch["labels"]]
return batch
def munge_dataset_to_pacify_bert(dataset, encoder_max_length=512, decoder_max_length=512, batch_size=2):
bert_wants_to_see = ["input_ids", "attention_mask", "decoder_input_ids",
"decoder_attention_mask", "labels"]
_process_data_to_model_inputs = partial(process_data_to_model_inputs,
encoder_max_length=encoder_max_length,
decoder_max_length=decoder_max_length,
batch_size=batch_size
)
dataset = dataset.map(_process_data_to_model_inputs,
batched=True,
batch_size=batch_size
)
dataset.set_format(type="torch", columns=bert_wants_to_see)
return dataset
train_data = munge_dataset_to_pacify_bert(train_data)
val_data = munge_dataset_to_pacify_bert(val_data)
Then the training can be done easily as such:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments
# set training arguments - these params are not really tuned, feel free to change
training_args = Seq2SeqTrainingArguments(
output_dir="./",
evaluation_strategy="steps",
...
)
# instantiate trainer
trainer = Seq2SeqTrainer(
model=multibert,
tokenizer=tokenizer,
args=training_args,
train_dataset=train_data,
eval_dataset=val_data,
)
trainer.train()
A working example can be found on something like: https://www.kaggle.com/code/alvations/neural-plasticity-bert2bert-on-wmt14
However, parallel data used to train an EncoderDecoderModel usually exists as .txt
or .tsv
files, not a pre-coded dataset
Given a large .tsv
file (e.g. 1 billion lines), e.g.
hello worldtHallo Welt
how are you?twie gehts?
...t...
Step 1: we can convert into the parquet / pyarrow format, one can do something like:
import vaex # Using vaex
import sys
filename = "train.en-de.tsv"
df = vaex.from_csv(filename, sep="t", header=None, names=["src", "trg"], convert=True, chunk_size=50_000_000)
df.export(f"{filename}.parquet")
Step 2: Then we will can read it into a Pyarrow table to fit into the datasets.Dataset
object and use the munge_dataset_to_pacify_bert()
as shown above, e.g
from datasets import Dataset, load_from_disk
import pyarrow as pa
_ds = Dataset(pa.compute.drop_null(pa.parquet.read_table('train.en-de.tsv.parquet')
_ds.save_to_disk('train.en-de.tsv.parquet.hfdataset')
_ds = load_from_disk('train.en-de.tsv.parquet.hfdataset')
train_data = munge_dataset_to_pacify_bert(_ds)
train_data.save_to_disk('train.en-de.tsv.parquet.hfdataset')
While the process above works well for small-ish dataset, e.g. 1-5 million lines of data, when the scale of the goes to 500 million to 1 billion, the last .save_to_disk()
function seems like it is runningf "forever" and the end is no where in sight.
Breaking down the steps in the munge_dataset_to_pacify_bert()
, there are 2 sub-functions:
dataset.map(_process_data_to_model_inputs, batched=True, batch_size=batch_size)
dataset.set_format(type="torch", columns=bert_wants_to_see)
For the .map()
process, it’s possible to scale in parallel threads by specifying by
dataset.map(_process_data_to_model_inputs,
batched=True, batch_size=100,
num_proc=32 # num of parallel threads.
)
And when I tried to process with
num_proc=32
batch_size=100
The .map()
function finishes the processing of 500 million lines in 18 hours of compute time on Intel Xeon E5-2686 @ 2.3GHz with 32 processor cores, optimally.
But somehow the .map()
function created 32 temp .arrow
files and 128 tmp...
binary files. Seemingly the last save_to_disk
function has been running for more than 10+ hours and have not finished combining the temp files in parts to save the final HF Dataset to disk.
Given the above context, my questions in parts are:
Question (Part 1): When the mapping function ends and created the temp .arrow
and tmp...
files, is there a way to read these individually instead of try to save them into a final directory using the save_to_disk()
function?
Question (Part 2): Why is the save_to_disk()
function so slow after the mapping and how can the mapped processed data be saved in a faster manner?
Question (Part 3): Is there a way to avoid the .set_format()
function after the .map()
and make it part of the _process_data_to_model_inputs
function?
TL;DR
(Answer’s credits goes to @lhoestq)
If you have a TSV file that looks like this:
hello worldtHallo Welt
how are you?twie gehts?
...t...
load the dataset as such:
# tatoeba-sentpairs.tsv is a pretty large file.
ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv",
streaming=True, delimiter="t", split="train")
In Long
Reason not to use parquet, run map functions and save the outputs:
- Loading a large dataset into parquet is already quite a feat, in the thread, see Step 1 in question, so lets avoid that
- Mapping the data into the BERT format, i.e.
munge_dataset_to_pacify_bert
is also quite expensive operation. If that is done for 1B lines and even if it’s thread-parallelized, it will take hours to days to complete - The resulting tensors that are saved with
dataset.set_format(type="torch")
is massive, a ~50GB of tsv with 1B lines will easily become TBs of binaries.
Instead, use stream-style processing,
Huggingface datasets
supports it with stream=True
when defining the dataset:
ds = load_dataset("csv", data_files="../input/tatoeba/tatoeba-sentpairs.tsv",
streaming=True, delimiter="t", split="train")
How to efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel?
A parallel corpus is a collection of texts in two or more languages that are aligned or translated versions of each other. For example, a parallel corpus of English and French sentences can be used to train a machine translation model that can translate from one language to another.
A Huggingface dataset is a standardized and lightweight way of handling and processing data for natural language processing (NLP) tasks. It provides various features such as caching, streaming, filtering, shuffling, and splitting of data. A Huggingface dataset can be created from various sources, such as local files, online files, pandas dataframes, or in-memory data.
An EncoderDecoderModel is a type of model that consists of two sub-models: an encoder and a decoder. The encoder takes an input sequence and encodes it into a hidden representation. The decoder takes the hidden representation and generates an output sequence. An EncoderDecoderModel can be used for various NLP tasks that involve generating sequences, such as machine translation, text summarization, or text generation.
To efficiently convert a large parallel corpus to a Huggingface dataset to train an EncoderDecoderModel, you can follow these steps:
Step 1: Load the parallel corpus from local files or online files
You can use the load_dataset
function from the datasets
library to load the parallel corpus from local files or online files. You need to specify the format of the files, such as csv, json, or txt, and the column names or fields that contain the source and target texts. For example, if you have a parallel corpus of English and French sentences in two csv files, you can load them as follows:
from datasets import load_dataset
dataset = load_dataset(
"csv",
data_files={
"en": "path/to/english.csv",
"fr": "path/to/french.csv",
},
column_names=["text"],
)
Step 2: Preprocess the source and target texts
You can use the map
method of the dataset
object to apply some preprocessing functions to the source and target texts. For example, you can tokenize the texts using a tokenizer from the transformers
library, such as BertTokenizer
or MarianTokenizer
. You can also truncate or pad the texts to a fixed length, and add special tokens such as <s>
and </s>
to mark the start and end of the sequences. For example, if you want to use a MarianTokenizer
for English and French, you can preprocess the texts as follows:
from transformers import MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
def preprocess_function(examples):
inputs = [ex["text"] for ex in examples["en"]]
outputs = [ex["text"] for ex in examples["fr"]]
inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
outputs = tokenizer(outputs, max_length=128, truncation=True, padding="max_length")
inputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in inputs["input_ids"]]
outputs["input_ids"] = [[tokenizer.bos_token_id] + ids + [tokenizer.eos_token_id] for ids in outputs["input_ids"]]
return {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"], "decoder_input_ids": outputs["input_ids"], "decoder_attention_mask": outputs["attention_mask"]}
dataset = dataset.map(preprocess_function, batched=True)
Step 3: Split the dataset into train, validation, and test sets
You can use the train_test_split
method of the dataset
object to split the dataset into train, validation, and test sets. You need to specify the ratio or size of each set, and optionally a random seed for reproducibility. For example, if you want to split the dataset into 80% train, 10% validation, and 10% test sets, you can do as follows:
dataset = dataset.train_test_split(test_size=0.1, seed=42)
test_dataset = dataset["test"]
dataset = dataset["train"].train_test_split(test_size=0.1111, seed=42)
train_dataset = dataset["train"]
val_dataset = dataset["test"]
Step 4: Train the EncoderDecoderModel using the Trainer
class from the transformers
library
You can use the Trainer
class from the transformers
library to train the EncoderDecoderModel using the Huggingface dataset. You need to specify the model, the training arguments, the data collator, the training and validation datasets, and optionally the evaluation metrics. For example, if you want to train a MarianMTModel
for English to French translation, you can do as follows:
from transformers import MarianMTModel, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, MarianConfig
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-fr")
training_args = Seq2SeqTrainingArguments(
output_dir="path/to/output",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
logging_steps=100,
save_steps=500,
evaluation_strategy="steps",
predict_with_generate=True,
)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)
trainer.train()
Step 5: Evaluate the EncoderDecoderModel on the test set using the Trainer
class
You can use the evaluate
method of the trainer
object to evaluate the EncoderDecoderModel on the test set. You can also use the predict
method to generate predictions on the test set. You can use the compute_metrics
function to define the evaluation metrics, such as BLEU, ROUGE, or METEOR. For example, if you want to evaluate the MarianMTModel
on the test set using BLEU, you can do as follows:
from datasets import load_metric
metric = load_metric("sacrebleu")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = [tokenizer.decode(pred, skip_special_tokens=True) for pred in predictions]
labels = [tokenizer.decode(label, skip_special_tokens=True) for label in labels]
return metric.compute(predictions=predictions, references=labels)
trainer.compute_metrics = compute_metrics
test_results = trainer.evaluate(test_dataset)
test_predictions = trainer.predict(test_dataset)