# Problem loading parallel datasets even after using SubsetRandomSampler

## Question:

I have two parallel datasets `dataset1`

and `dataset2`

and following is my code to load them in parallel using `SubsetRandomSampler`

where I provide `train_indices`

for dataloading.

P.S. Even after setting `num_workers=0`

and seeding `np`

as well as `torch`

, the samples do not get loaded in parallel. Any suggestions are heartily welcome including methods other than `SubsetRandomSampler`

.

```
import torch, numpy as np
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
train_indices = list(range(len(dataset1)))
torch.manual_seed(12)
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = SubsetRandomSampler(train_indices)
dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)
for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
x = data1
y = data2
print(x, y)
```

Output:

```
tensor([5, 1]) tensor([15, 18])
tensor([0, 2]) tensor([14, 12])
tensor([4, 6]) tensor([16, 10])
tensor([8, 9]) tensor([11, 19])
tensor([7, 3]) tensor([17, 13])
```

Expected Output:

```
tensor([5, 1]) tensor([15, 11])
tensor([0, 2]) tensor([10, 12])
tensor([4, 6]) tensor([14, 16])
tensor([8, 9]) tensor([18, 19])
tensor([7, 3]) tensor([17, 13])
```

## Answers:

It looks like you are trying to load the two datasets in parallel, but have them maintain the same shuffled order.

Currently, the code is shuffling the indices for `dataset1`

and then using those same shuffled indices to sample from both `dataset1`

and `dataset2`

. However, this does not guarantee that the same elements will be paired together in the output, as `dataset2`

is shuffled separately from `dataset1`

.

To achieve your expected output, you would need to shuffle both datasets together, and then use the shuffled indices to sample from both datasets. One way to do this would be to first combine the two datasets into a single dataset containing tuples of corresponding elements from each dataset, and then shuffle the combined dataset. Then, you could use the shuffled indices to create two separate dataloaders, each of which would return the corresponding elements from each dataset.

Here is an example of how this could be done:

```
# combine the two datasets into a single dataset of tuples
combined_dataset = list(zip(dataset1, dataset2))
# shuffle the combined dataset
train_indices = list(range(len(combined_dataset)))
np.random.seed(12)
np.random.shuffle(train_indices)
# create the dataloaders
dataloader = DataLoader(combined_dataset, batch_size=2, num_workers=0, sampler=SubsetRandomSampler(train_indices))
# unpack the elements from the tuples in each batch
for i, (data1, data2) in enumerate(dataloader):
x = data1
y = data2
print(x, y)
```

Since I was using a random sampler, the random indices are expected.

To yield the same (shuffled) indices from both DataLoaders, it is better to create the indices first, and then use a custom sampler:

```
class MySampler(torch.utils.data.sampler.Sampler):
def __init__(self, indices):
self.indices = indices
def __iter__(self):
return iter(self.indices)
def __len__(self):
return len(self.indices)
dataset1 = torch.tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
dataset2 = torch.tensor([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
train_indices = list(range(len(dataset1)))
np.random.seed(12)
np.random.shuffle(train_indices)
sampler = MySampler(train_indices)
dataloader1 = DataLoader(dataset1, batch_size=2, num_workers=0, sampler=sampler)
dataloader2 = DataLoader(dataset2, batch_size=2, num_workers=0, sampler=sampler)
for i, (data1, data2) in enumerate(zip(dataloader1, dataloader2)):
x = data1
y = data2
print(x, y)
```

P.S. got the solution by cross-posting on Pytorch forums but still want to keep it for future readers. Credits to ptrblck.