Difference between tf.data.Dataset.map() and tf.data.Dataset.apply()

Question:

With the recent upgrade to version 1.4, Tensorflow included tf.data in the library core.
One “major new feature” described in the version 1.4 release notes is tf.data.Dataset.apply(), which is a “method for
applying custom transformation functions”. How is this different from the already existing tf.data.Dataset.map()?

Asked By: GPhilo

||

Answers:

The difference is that map will execute one function on every element of the Dataset separately, whereas apply will execute one function on the whole Dataset at once (such as group_by_window given as example in the documentation).

The argument of apply is a function that takes a Dataset and returns a Dataset when the argument of map is a function that takes one element and returns one transformed element.

Answered By: Sunreef

Sunreef’s answer is absolutely correct. You might still be wondering why we introduced Dataset.apply(), and I thought I’d offer some background.

The tf.data API has a set of core transformations—like Dataset.map() and Dataset.filter()—that are generally useful across a wide range of datasets, unlikely to change, and implemented as methods on the tf.data.Dataset object. In particular, they are subject to the same backwards compatibility guarantees as other core APIs in TensorFlow.

However, the core approach is a bit restrictive. We also want the freedom to experiment with new transformations before adding them to the core, and to allow other library developers to create their own reusable transformations. Therefore, in TensorFlow 1.4 we split out a set of custom transformations that live in tf.contrib.data. The custom transformations include some that have very specific functionality (like tf.contrib.data.sloppy_interleave()), and some where the API is still in flux (like tf.contrib.data.group_by_window()). Originally we implemented these custom transformations as functions from Dataset to Dataset, which had an unfortunate effect on the syntactic flow of a pipeline. For example:

dataset = tf.data.TFRecordDataset(...).map(...)

# Method chaining breaks when we apply a custom transformation.
dataset = custom_transformation(dataset, x, y, z)

dataset = dataset.shuffle(...).repeat(...).batch(...)

Since this seemed to be a common pattern, we added Dataset.apply() as a way to chain core and custom transformations in a single pipeline:

dataset = (tf.data.TFRecordDataset(...)
           .map(...)
           .apply(custom_transformation(x, y, z))
           .shuffle(...)
           .repeat(...)
           .batch(...))

It’s a minor feature in the grand scheme of things, but hopefully it helps to make tf.data programs easier to read, and the library easier to extend.

Answered By: mrry

I don’t have enough reputation to comment, but I just wanted to point out that you can actually use map to apply to multiple elements in a dataset contrary to @sunreef’s comments on his own post.

According to the documentation, map takes as an argument

map_func: A function mapping a nested structure of tensors (having
shapes and types defined by self.output_shapes and self.output_types)
to another nested structure of tensors.

the output_shapes are defined by the dataset and can be modified by using api functions like batch. So, for example, you can do a batch normalization using only dataset.batch and .map with:

dataset = dataset ...
dataset.batch(batch_size)
dataset.map(normalize_fn)

It seems like the primary utility of apply() is when you really want to do a transformation across the entire dataset.

Answered By: zephyrus

Simply, the arguement of transformation_func of apply() is Dataset; the arguement of map_func of map() is element

Answered By: 武状元 Woa