how to get the x and y as numpy array from a tensorflow prefetch tf.data.Dataset?

Question:

I need to access access my X features and Y labels from a prefetch train dataset. I know that if I loop through the dataset I can have Xs and Ys printed. For instance:

for item in train_dataset:
    print(item[0]) #access array with X 
    print(item[1]) #access array Y

but I actually need to separate X from Y to store them in separated numpy variables just like we do with X_train and Y_train when using sklearn train_test_split() function. As I they will serve as parameters for another function that does not accept prefetch datasets, but only numpy array of Xs and numpy array of Ys. Does anyone have an idea how this can be done?

Asked By: ForeverLearner

||

Answers:

You can use tfds.as_numpy on prefetch dataset and apply map, list then get numpy.array like below:

from sklearn.model_selection import train_test_split
import tensorflow_datasets as tfds
import tensorflow as tf
import numpy as np

# Generate random data for Dataset
X = np.random.rand(100,3)
y = np.random.randint(0,2, (100))

# Create tf.data.Dataset from random data
train_dataset = tf.data.Dataset.from_tensor_slices((X,y))
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

# Extract numpy.array X & y from tf.data.Dataset
X_numpy = np.asarray(list(map(lambda x: x[0], tfds.as_numpy(train_dataset))))
y_numpy = np.asarray(list(map(lambda x: x[1], tfds.as_numpy(train_dataset))))

print(X_numpy.shape)
# (100, 3)
print(y_numpy.shape)
# (100,)

X_train, X_test, y_train, y_test = train_test_split(X_numpy, y_numpy, 
                                                    test_size=0.2, 
                                                    random_state=42)
print(X_train.shape)
# (80, 3)
print(X_test.shape)
# (20, 3)
print(y_train.shape)
# (80,)
print(y_test.shape)
# (20,)
Answered By: I'mahdi
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.