Test train split while retaining original dimension

Question:

I am trying to split a pandas dataframe of size 610×9724 (610 users x 9724 movies), putting 80% of the non-null values of the dataset into training and 20% of the remaining non-null values into the test set while replacing the 20% removed values from training with null and likewise replacing the removed values from the test set with null (training set and test set would still be 610×9724 but just with more nulls than original dataset).

I would then use SVD on the test set (610×9724) to predict the removed values which are in the test set.

I have tried using sklearn train_test_split but after splitting, the train set becomes dimension 549×9724 and the validation set becomes 61×9724 which makes it difficult to take the RMSE between predicted and test set. Is there an easy way to do this split?

data = df.pivot_table(index='userId', columns='movieId', values='rating')

data_train, data_valid = model_selection.train_test_split(
    data, test_size=0.1, random_state=42
)

print(data.shape) # (610, 9724)
print(data_train.shape) # (549, 9724)
print(data_valid.shape) # (61, 9724)
Asked By: Bothurin

||

Answers:

You can reindex your dataframes to restore the initial dimension. Every values from missing index will be set to NaN:

train, test = train_test_split(data, test_size=0.2, random_state=42)

train = train.reindex(data.index)
test = test.reindex(data.index)

Output:

>>> train.shape
(610, 9724)

>>> test.shape
(610, 9724)
Answered By: Corralien