tf.data.Dataset: how to get the dataset size (number of elements in an epoch)?
Question:
Let’s say I have defined a dataset in this way:
filename_dataset = tf.data.Dataset.list_files("{}/*.png".format(dataset))
how can I get the number of elements that are inside the dataset (hence, the number of single elements that compose an epoch)?
I know that tf.data.Dataset
already knows the dimension of the dataset, because the repeat()
method allows repeating the input pipeline for a specified number of epochs. So it must be a way to get this information.
Answers:
tf.data.Dataset.list_files
creates a tensor called MatchingFiles:0
(with the appropriate prefix if applicable).
You could evaluate
tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]
to get the number of files.
Of course, this would work in simple cases only, and in particular if you have only one sample (or a known number of samples) per image.
In more complex situations, e.g. when you do not know the number of samples in each file, you can only observe the number of samples as an epoch ends.
To do this, you can watch the number of epochs that is counted by your Dataset
. repeat()
creates a member called _count
, that counts the number of epochs. By observing it during your iterations, you can spot when it changes and compute your dataset size from there.
This counter may be buried in the hierarchy of Dataset
s that is created when calling member functions successively, so we have to dig it out like this.
d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround
RepeatDataset = type(tf.data.Dataset().repeat())
try:
while not isinstance(d, RepeatDataset):
d = d._input_dataset
except AttributeError:
warnings.warn('no epoch counter found')
epoch_counter = None
else:
epoch_counter = d._count
Note that with this technique, the computation of your dataset size is not exact, because the batch during which epoch_counter
is incremented typically mixes samples from two successive epochs. So this computation is precise up to your batch length.
len(list(dataset))
works in eager mode, although that’s obviously not a good general solution.
Unfortunately, I don’t believe there is a feature like that yet in TF. With TF 2.0 and eager execution however, you could just iterate over the dataset:
num_elements = 0
for element in dataset:
num_elements += 1
This is the most storage efficient way I could come up with
This really feels like a feature that should have been added a long time ago. Fingers crossed they add this a length feature in a later version.
Take a look here: https://github.com/tensorflow/tensorflow/issues/26966
It doesn’t work for TFRecord datasets, but it works fine for other types.
TL;DR:
num_elements = tf.data.experimental.cardinality(dataset).numpy()
UPDATE:
Use tf.data.experimental.cardinality(dataset)
– see here.
In case of tensorflow datasets you can use _, info = tfds.load(with_info=True)
. Then you may call info.splits['train'].num_examples
. But even in this case it doesn’t work properly if you define your own split.
So you may either count your files or iterate over the dataset (like described in other answers):
num_training_examples = 0
num_validation_examples = 0
for example in training_set:
num_training_examples += 1
for example in validation_set:
num_validation_examples += 1
For some datasets like COCO, cardinality function does not return a size. One way to compute size of a dataset fast is to use map reduce, like so:
ds.map(lambda x: 1, num_parallel_calls=tf.data.experimental.AUTOTUNE).reduce(tf.constant(0), lambda x,_: x+1)
Bit late to the party but for a large dataset stored in TFRecord datasets I used this (TF 1.15)
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
dataset = tf.data.TFRecordDataset('some_path')
# Count
n = 0
take_n = 200000
for samples in dataset.batch(take_n):
n += take_n
print(n)
In TF2.0, I do it like
for num, _ in enumerate(dataset):
pass
print(f'Number of elements: {num}')
You can use this for TFRecords in TF2:
ds = tf.data.TFRecordDataset(dataset_filenames)
ds_size = sum(1 for _ in ds)
As of TensorFlow (>=2.3
) one can use:
dataset.cardinality().numpy()
Note that the .cardinality()
method was integrated into the main package (before it was in the experimental
package).
NOTE that when applying the filter()
operation this operation can return -2
.
This has worked for me:
lengt_dataset = dataset.reduce(0, lambda x,_: x+1).numpy()
It iterates over your dataset and increments the var x, which is returned as the length of the dataset.
Let’s say you want to find out the number of the training split in the oxford-iiit-pet dataset:
ds, info = tfds.load('oxford_iiit_pet', split='train', shuffle_files=True, as_supervised=True, with_info=True)
print(info.splits['train'].num_examples)
you can do it in tensorflow 2.4.0 with just len(filename_dataset)
As in version=2.5.0, you can simply call print(dataset.cardinality())
to see the length and type of the dataset.
I am very surprised that this problem does not have an explicit solution, because this was such a simple feature. When I iterate over the dataset through TQDM, I find that TQDM finds the data size. How does this work?
for x in tqdm(ds['train']):
//Something
-> 1%| | 15643/1281167 [00:16<07:06, 2964.90it/s]v
t=tqdm(ds['train'])
t.total
-> 1281167
I saw many methods of getting the number of samples, but actually you can easily do it by in keras
:
len(dataset) * BATCH_SIZE
In TensorFlow 2.6.0 (I am not sure if it was possible in earlier versions or no):
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#__len__
Dataset.__len__()
For early Tensorflow versions (2.1 or higher):
sum(dataset.map(lambda x: 1).as_numpy_iterator())
That way you don’t have to load each object in your dataset to your run memory, instead you put 1’s and sum it.
Let’s say I have defined a dataset in this way:
filename_dataset = tf.data.Dataset.list_files("{}/*.png".format(dataset))
how can I get the number of elements that are inside the dataset (hence, the number of single elements that compose an epoch)?
I know that tf.data.Dataset
already knows the dimension of the dataset, because the repeat()
method allows repeating the input pipeline for a specified number of epochs. So it must be a way to get this information.
tf.data.Dataset.list_files
creates a tensor called MatchingFiles:0
(with the appropriate prefix if applicable).
You could evaluate
tf.shape(tf.get_default_graph().get_tensor_by_name('MatchingFiles:0'))[0]
to get the number of files.
Of course, this would work in simple cases only, and in particular if you have only one sample (or a known number of samples) per image.
In more complex situations, e.g. when you do not know the number of samples in each file, you can only observe the number of samples as an epoch ends.
To do this, you can watch the number of epochs that is counted by your Dataset
. repeat()
creates a member called _count
, that counts the number of epochs. By observing it during your iterations, you can spot when it changes and compute your dataset size from there.
This counter may be buried in the hierarchy of Dataset
s that is created when calling member functions successively, so we have to dig it out like this.
d = my_dataset
# RepeatDataset seems not to be exposed -- this is a possible workaround
RepeatDataset = type(tf.data.Dataset().repeat())
try:
while not isinstance(d, RepeatDataset):
d = d._input_dataset
except AttributeError:
warnings.warn('no epoch counter found')
epoch_counter = None
else:
epoch_counter = d._count
Note that with this technique, the computation of your dataset size is not exact, because the batch during which epoch_counter
is incremented typically mixes samples from two successive epochs. So this computation is precise up to your batch length.
len(list(dataset))
works in eager mode, although that’s obviously not a good general solution.
Unfortunately, I don’t believe there is a feature like that yet in TF. With TF 2.0 and eager execution however, you could just iterate over the dataset:
num_elements = 0
for element in dataset:
num_elements += 1
This is the most storage efficient way I could come up with
This really feels like a feature that should have been added a long time ago. Fingers crossed they add this a length feature in a later version.
Take a look here: https://github.com/tensorflow/tensorflow/issues/26966
It doesn’t work for TFRecord datasets, but it works fine for other types.
TL;DR:
num_elements = tf.data.experimental.cardinality(dataset).numpy()
UPDATE:
Use tf.data.experimental.cardinality(dataset)
– see here.
In case of tensorflow datasets you can use _, info = tfds.load(with_info=True)
. Then you may call info.splits['train'].num_examples
. But even in this case it doesn’t work properly if you define your own split.
So you may either count your files or iterate over the dataset (like described in other answers):
num_training_examples = 0
num_validation_examples = 0
for example in training_set:
num_training_examples += 1
for example in validation_set:
num_validation_examples += 1
For some datasets like COCO, cardinality function does not return a size. One way to compute size of a dataset fast is to use map reduce, like so:
ds.map(lambda x: 1, num_parallel_calls=tf.data.experimental.AUTOTUNE).reduce(tf.constant(0), lambda x,_: x+1)
Bit late to the party but for a large dataset stored in TFRecord datasets I used this (TF 1.15)
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
dataset = tf.data.TFRecordDataset('some_path')
# Count
n = 0
take_n = 200000
for samples in dataset.batch(take_n):
n += take_n
print(n)
In TF2.0, I do it like
for num, _ in enumerate(dataset):
pass
print(f'Number of elements: {num}')
You can use this for TFRecords in TF2:
ds = tf.data.TFRecordDataset(dataset_filenames)
ds_size = sum(1 for _ in ds)
As of TensorFlow (>=2.3
) one can use:
dataset.cardinality().numpy()
Note that the .cardinality()
method was integrated into the main package (before it was in the experimental
package).
NOTE that when applying the filter()
operation this operation can return -2
.
This has worked for me:
lengt_dataset = dataset.reduce(0, lambda x,_: x+1).numpy()
It iterates over your dataset and increments the var x, which is returned as the length of the dataset.
Let’s say you want to find out the number of the training split in the oxford-iiit-pet dataset:
ds, info = tfds.load('oxford_iiit_pet', split='train', shuffle_files=True, as_supervised=True, with_info=True)
print(info.splits['train'].num_examples)
you can do it in tensorflow 2.4.0 with just len(filename_dataset)
As in version=2.5.0, you can simply call print(dataset.cardinality())
to see the length and type of the dataset.
I am very surprised that this problem does not have an explicit solution, because this was such a simple feature. When I iterate over the dataset through TQDM, I find that TQDM finds the data size. How does this work?
for x in tqdm(ds['train']):
//Something
-> 1%| | 15643/1281167 [00:16<07:06, 2964.90it/s]v
t=tqdm(ds['train'])
t.total
-> 1281167
I saw many methods of getting the number of samples, but actually you can easily do it by in keras
:
len(dataset) * BATCH_SIZE
In TensorFlow 2.6.0 (I am not sure if it was possible in earlier versions or no):
https://www.tensorflow.org/api_docs/python/tf/data/Dataset#__len__
Dataset.__len__()
For early Tensorflow versions (2.1 or higher):
sum(dataset.map(lambda x: 1).as_numpy_iterator())
That way you don’t have to load each object in your dataset to your run memory, instead you put 1’s and sum it.