Merging two tensorflow Datasets, albeit at a different pace

Question:

I am looking for a way to merge a Dataset with another, but by drawing samples from it only occasionally.

For example, given these two Datasets

ds1 = tf.data.Dataset.range(1, 10).repeat()
ds10 = tf.data.Dataset.range(10, 100, 10).repeat()

I would like to add samples from ds10 to those of ds1 but only for every two samples, so that the result would be

ds = my_merge(ds1, ds10)
list(ds)
# 11, 2, 23, 4, 35, 6, 47...

Is this possible? I would like to avoid solutions discarding samples from ds10 as this would be inefficient in my case.

EDIT The resulting ds needs to be a Dataset so that further input pipeline operations (e.g. batching) can be applied.

Asked By: user209974

||

Answers:

You can create your own generator:

import tensorflow as tf
from functools import partial

ds1_unrepeated = tf.data.Dataset.range(1, 10)  # because repeat prevents element_spec
ds1_spec = ds1_unrepeated.element_spec
ds1 = ds1_unrepeated.repeat()
ds10 = tf.data.Dataset.range(10, 100, 10).repeat()

def my_merge(iter1,iter2):
    sliced_iter2 = iter(iter2)
    sliced_iter1 = iter(iter1)
    while True:
        yield next(sliced_iter1)+next(sliced_iter2)
        yield next(sliced_iter1)


ds = tf.data.Dataset.from_generator(partial(my_merge,ds1,ds10),output_signature=ds1_spec)
for element in ds:
    print(element)
tf.Tensor(11, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(23, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(35, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(47, shape=(), dtype=int64)

Edit: I have updated it to be a dataset, but I think the answer at the top is more efficient, this answer is only if the answer should be as lazily evaluated as possible, with little knowledge about the inputs, ie: the merging can be arbitrarily complex.

Answered By: Ahmed AEK

Modify ds10 dataset based on skip parameter

skip = 2

pattern = np.concatenate(([0], np.ones((skip-1)))).astype(np.int64)
choice_dataset = tf.data.Dataset.from_tensor_slices((pattern)).repeat()

zeros = tf.data.Dataset.range(0,1).repeat()
ds10 = tf.data.Dataset.choose_from_datasets([ds10, zeros], choice_dataset)

#[10, 0, 20, 0, 30, 0, 40, 0, 50]

Zip and add both dataset values

ds = tf.data.Dataset.zip((ds1,ds10))
ds = ds.map(lambda x,y:x+y)

#[11, 2, 23, 4, 35, 6, 47, 8, 59]

Checking the performance,

def time_ds():
  for element in ds.take(1000):
    pass
def time_ds1():
  for element in ds1.take(1000):
    pass
%timeit time_ds() 29.3 ms ± 133 µs 
%timeit time_ds1() 23.5 ms ± 94.7 µs per loop
Answered By: V.M