Pytorch dataloaders : Bad file descriptor and EOF for workers>0

Question:

Description of the problem

I am encountering a strange behavior during a neural network training with Pytorch dataloaders made from a custom dataset. The dataloaders are set with workers=4, pin_memory=False.

Most of the time, the training finished with no problems.
Sometimes, the training stopped at a random moment with the following errors:

  1. OSError: [Errno 9] Bad file descriptor
  2. EOFError

It looks like the error occurs during socket creation to access dataloader elements.
The error disappears when I set the number of workers to 0, but I need to accelerate my training with multiprocessing.
What could be the source of the error ? Thank you !

The versions of python and libraries

Python 3.9.12, Pyorch 1.11.0+cu102
EDIT: The error is occurring only on clusters

Output of error file

Traceback (most recent call last):
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17:  52%|█████▏    | 253/486 [01:00<00:55,  4.18it/s, loss=1.73]

Traceback (most recent call last):
  File "/my_directory/bench/run_experiments.py", line 251, in <module>
    send(conn, destination_pid)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
    return socket(family, type, proto, nfd)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
    _socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor

    main(args)
  File "/my_directory/bench/run_experiments.py", line 183, in main
    run_experiments(args, save_path)
  File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
    ) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
  File "/my_directorybench/algorithms.py", line 38, in run_algorithm
    data = es(mp,search_space,  dataset, **ps)
  File "/my_directorybench/algorithms.py", line 151, in es
   data = ss.generate_random_dataset(mp,
  File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
    arch_dict = self.query_arch(
  File "/my_directory/bench/architectures.py", line 71, in query_arch
    train_losses, val_losses, model = meta_net.get_val_loss(
  File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
    return self.training(
  File "/my_directorybench/meta_neural_net.py", line 155, in training
    train_loss = self.train_step(model, device, train_loader, epoch)
  File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
    for batch_idx, mini_batch in enumerate(pbar):
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError

EDIT : The way data is accessed

    from PIL import Image
    from torch.utils.data import DataLoader
    
    # extract of code of dataset
    
        class Dataset():
           def __init__(self,image_files,mask_files):
              self.image_files = image_files
              self.mask_files = mask_files
    
           def __getitem__(self, idx):
              img = Image.open(self.image_files[idx]).convert('RGB')
              mask=Image.open(self.mask_files[idx]).convert('L')
              return img, mask
    
    # extract of code of trainloader
      
        train_loader = DataLoader(
                        dataset=train_dataset,
                        batch_size=4,
                        num_workers=4,
                        pin_memory=False,
                        shuffle=True,
                        drop_last=True,
                        persistent_workers=False,
                    )

Answers:

I have finally found a solution. Adding this configuration to the dataset script works:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

By default, the sharing strategy is set to 'file_descriptor'.

I have tried some solutions explained in :

  • this issue (increase shared memory, increase max number of opened file descriptors, torch.cuda.empty_cache() at the end of each epoch, …)
  • and this other issue, that turns out to solve the problem

As suggested by @AlexMeredith, the error may be linked to the distributed filesystem (Lustre) that some clusters use. The error may also come from distributed shared memory.

In this example there’s only dataset implementation but there’s no snippet showing what’s happening with the batches.

In my case, I was storing the batches in the index array-like object which fortunately has been described here. The dataloader couldn’t close the subprocesses because of that. Implementing something similar to that helped me solving this problem.

import copy

for batch in data_loader:  
    batch_cp = copy.deepcopy(batch)  
    del batch  
    index.append(batch_cp["index"])

I also got other errors related to this one, such as:

  • received 0 items of ancdata
  • bad message length
Answered By: pdaawr