tf.data.datasets set each batch (prefetch)

Question:

I am looking for help thinking through this.

I have a function (that is not a generator) that will give me any number of samples.
Let’s say that getting all the data I want to train (1000 samples) can’t fit into memory.
So I want to call this function 10 times to get smaller number of samples that fit into memory.

This is a dummy example for simplicity.

def get_samples(num_samples: int, random_seed=0):
    np.random.seed(random_seed)
    x = np.random.randint(0,100, num_samples)
    y = np.random.randint(0,2, num_samples)
    return np.array(list(zip(x,y))

Again lets say get_samples(1000,0) won’t fit into memory.

So in theory I am looking for something like this:

batch_size = 100
total_num_samples = 1000
batches = []
for i in range(total_num_samples//batch_size):
    batches.append(get_samples(batch_size, i))

But this still loads everything into memory.

Again this function is a dummy representation and the real one is already defined and not a generator.

In tf land. I was hoping that:

tf.data.Dataset.batch[0] would equal to the output of get_data(100,0)
tf.data.Dataset.batch[1] would equal to the output of get_data(100,1)
tf.data.Dataset.batch[2] would equal to the output of get_data(100,2)
...
tf.data.Dataset.batch[9] would equal to the output of get_data(100,9)

I understand that I can use tf.data.Datasets with a generator (and I think you can set a generator per batch). But the function I have gives more than a single sample. The set up is too expensive to set it up for a every single sample.

I was wanting to use tf.data.Dataset.prefetch() to run the get_batch function on every batch. And of course, it would call the get_batch with the same parameters on every epoch.

Sorry if the explaination is convoluted. Trying my best to describe the problem.

Anyone have any ideas?

Asked By: lr100

||

Answers:

This what I came up with:

def simple_static_synthesizer(batch_size, seed=1, verbose=True):
    if verbose:
        print(f"Creating Synthetic Data with seed {seed}")
    rng = np.random.default_rng(seed)
    all_x = []
    all_y = []
    for i in range(batch_size):
        x = np.array(np.concatenate((rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int), rng.integers(0,100, 1, dtype=int))))
        y = np.array(rng.integers(0,2,1, dtype=int))
        all_x.append(x)
        all_y.append(y)
    return all_x, all_y

def my_generator(total_size, batch_size, seed=0, verbose=True):
    counter = 0
    for i in range(total_size):
        # Regenerate for each batch
        if counter%batch_size == 0: # Regen data for every batch
            x,y = simple_static_synthesizer(batch_size,seed,verbose)
            seed += 1
        yield x[i%batch_size],y[i%batch_size]
        counter += 1
my_gen = my_generator(10,2,seed=1)

# See values
for x,y in my_gen:
    print(x,y)

# Call again, this give same answer as above
my_gen = my_generator(10,2,seed=1)
for x,y in my_gen:
    print(x,y)

# Dataset with small batches to see if it is doing it correctly
total_samples = 10
batch_size = 2
seed = 5

dataset = tf.data.Dataset.from_generator(
    my_generator,
    args=[total_samples,batch_size,seed],
    output_signature=(
        tf.TensorSpec(shape=(3,), dtype=tf.uint8),
        tf.TensorSpec(shape=(1,), dtype=tf.uint8),
    )
)
for i,(x,y) in enumerate(dataset):
    print(x.numpy(),y.numpy())
    if i == 4:
        break # shows first 3 syn calls

Wish we could have notebook answers!

Answered By: lr100