How can I merge two (or more) TensorFlow datasets?
Question:
I have fetched the CelebA datasets with 3 partitions as follows
>>> celeba_bldr = tfds.builder('celeb_a')
>>> datasets = celeba_bldr.as_dataset()
>>> datasets.keys()
dict_keys(['test', 'train', 'validation'])
ds_train = datasets['train']
ds_test = datasets['test']
ds_valid = datasets['validation']
Now, I want to merged them all into one dataset. For example, I would need to combine the train and validaiton together, or possibly, merge all of them together and then split them based on different subject-disjoint criterion of my own. Is there anyway to do that?
I could not find any option to do this in the docs https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset
Answers:
Looking at the docs you linked, dataset seems to have concatenate
method, so I’d presume you can get a joint dataset as:
ds_train = datasets['train']
ds_test = datasets['test']
ds_valid = datasets['validation']
ds = ds_train.concatenate(ds_test).concatenate(ds_valid)
See: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#concatenate
I want to also mention that if you need to concatenate multiple datasets (e.g., list of datasets), you can do in a more efficient way:
ds_l = [ds_1, ds_2, ds_3] # list of `Dataset` objects
# 1. create dataset where each element is a `tf.data.Dataset` object
ds = tf.data.Dataset.from_tensor_slices(ds_l)
# 2. extract all elements from datasets and concat them into one dataset
concat_ds = ds.interleave(
lambda x: x,
cycle_length=1,
num_parallel_calls=tf.data.AUTOTUNE,
)
You can also use flat_map()
but I suppose using interleave()
with parallel calls is faster. In general interleave
is a generalization offlat_map
.
If the datasets are comming from the same TFDS dataset, you can merge them directly with the split API:
ds = tfds.load('celeb_a', split='train+test+validation')
Or use special all
split:
ds = tfds.load('celeb_a', split='all')
Documentation: https://www.tensorflow.org/datasets/splits
I have fetched the CelebA datasets with 3 partitions as follows
>>> celeba_bldr = tfds.builder('celeb_a')
>>> datasets = celeba_bldr.as_dataset()
>>> datasets.keys()
dict_keys(['test', 'train', 'validation'])
ds_train = datasets['train']
ds_test = datasets['test']
ds_valid = datasets['validation']
Now, I want to merged them all into one dataset. For example, I would need to combine the train and validaiton together, or possibly, merge all of them together and then split them based on different subject-disjoint criterion of my own. Is there anyway to do that?
I could not find any option to do this in the docs https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset
Looking at the docs you linked, dataset seems to have concatenate
method, so I’d presume you can get a joint dataset as:
ds_train = datasets['train']
ds_test = datasets['test']
ds_valid = datasets['validation']
ds = ds_train.concatenate(ds_test).concatenate(ds_valid)
See: https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset#concatenate
I want to also mention that if you need to concatenate multiple datasets (e.g., list of datasets), you can do in a more efficient way:
ds_l = [ds_1, ds_2, ds_3] # list of `Dataset` objects
# 1. create dataset where each element is a `tf.data.Dataset` object
ds = tf.data.Dataset.from_tensor_slices(ds_l)
# 2. extract all elements from datasets and concat them into one dataset
concat_ds = ds.interleave(
lambda x: x,
cycle_length=1,
num_parallel_calls=tf.data.AUTOTUNE,
)
You can also use flat_map()
but I suppose using interleave()
with parallel calls is faster. In general interleave
is a generalization offlat_map
.
If the datasets are comming from the same TFDS dataset, you can merge them directly with the split API:
ds = tfds.load('celeb_a', split='train+test+validation')
Or use special all
split:
ds = tfds.load('celeb_a', split='all')
Documentation: https://www.tensorflow.org/datasets/splits