Pytorch dataloaders : Bad file descriptor and EOF for workers>0
Question:
Description of the problem
I am encountering a strange behavior during a neural network training with Pytorch dataloaders made from a custom dataset. The dataloaders are set with workers=4, pin_memory=False.
Most of the time, the training finished with no problems.
Sometimes, the training stopped at a random moment with the following errors:
- OSError: [Errno 9] Bad file descriptor
- EOFError
It looks like the error occurs during socket creation to access dataloader elements.
The error disappears when I set the number of workers to 0, but I need to accelerate my training with multiprocessing.
What could be the source of the error ? Thank you !
The versions of python and libraries
Python 3.9.12, Pyorch 1.11.0+cu102
EDIT: The error is occurring only on clusters
Output of error file
Traceback (most recent call last):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17: 52%|█████▏ | 253/486 [01:00<00:55, 4.18it/s, loss=1.73]
Traceback (most recent call last):
File "/my_directory/bench/run_experiments.py", line 251, in <module>
send(conn, destination_pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
return socket(family, type, proto, nfd)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor
main(args)
File "/my_directory/bench/run_experiments.py", line 183, in main
run_experiments(args, save_path)
File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
File "/my_directorybench/algorithms.py", line 38, in run_algorithm
data = es(mp,search_space, dataset, **ps)
File "/my_directorybench/algorithms.py", line 151, in es
data = ss.generate_random_dataset(mp,
File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
arch_dict = self.query_arch(
File "/my_directory/bench/architectures.py", line 71, in query_arch
train_losses, val_losses, model = meta_net.get_val_loss(
File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
return self.training(
File "/my_directorybench/meta_neural_net.py", line 155, in training
train_loss = self.train_step(model, device, train_loader, epoch)
File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
for batch_idx, mini_batch in enumerate(pbar):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
EDIT : The way data is accessed
from PIL import Image
from torch.utils.data import DataLoader
# extract of code of dataset
class Dataset():
def __init__(self,image_files,mask_files):
self.image_files = image_files
self.mask_files = mask_files
def __getitem__(self, idx):
img = Image.open(self.image_files[idx]).convert('RGB')
mask=Image.open(self.mask_files[idx]).convert('L')
return img, mask
# extract of code of trainloader
train_loader = DataLoader(
dataset=train_dataset,
batch_size=4,
num_workers=4,
pin_memory=False,
shuffle=True,
drop_last=True,
persistent_workers=False,
)
Answers:
I have finally found a solution. Adding this configuration to the dataset script works:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
By default, the sharing strategy is set to 'file_descriptor'
.
I have tried some solutions explained in :
- this issue (increase shared memory, increase max number of opened file descriptors, torch.cuda.empty_cache() at the end of each epoch, …)
- and this other issue, that turns out to solve the problem
As suggested by @AlexMeredith, the error may be linked to the distributed filesystem (Lustre) that some clusters use. The error may also come from distributed shared memory.
In this example there’s only dataset implementation but there’s no snippet showing what’s happening with the batches.
In my case, I was storing the batches in the index array-like object which fortunately has been described here. The dataloader couldn’t close the subprocesses because of that. Implementing something similar to that helped me solving this problem.
import copy
for batch in data_loader:
batch_cp = copy.deepcopy(batch)
del batch
index.append(batch_cp["index"])
I also got other errors related to this one, such as:
received 0 items of ancdata
bad message length
Description of the problem
I am encountering a strange behavior during a neural network training with Pytorch dataloaders made from a custom dataset. The dataloaders are set with workers=4, pin_memory=False.
Most of the time, the training finished with no problems.
Sometimes, the training stopped at a random moment with the following errors:
- OSError: [Errno 9] Bad file descriptor
- EOFError
It looks like the error occurs during socket creation to access dataloader elements.
The error disappears when I set the number of workers to 0, but I need to accelerate my training with multiprocessing.
What could be the source of the error ? Thank you !
The versions of python and libraries
Python 3.9.12, Pyorch 1.11.0+cu102
EDIT: The error is occurring only on clusters
Output of error file
Traceback (most recent call last):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 145, in _serve
Epoch 17: 52%|█████▏ | 253/486 [01:00<00:55, 4.18it/s, loss=1.73]
Traceback (most recent call last):
File "/my_directory/bench/run_experiments.py", line 251, in <module>
send(conn, destination_pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 50, in send
reduction.send_handle(conn, new_fd, pid)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 183, in send_handle
with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 545, in fromfd
return socket(family, type, proto, nfd)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/socket.py", line 232, in __init__
_socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 9] Bad file descriptor
main(args)
File "/my_directory/bench/run_experiments.py", line 183, in main
run_experiments(args, save_path)
File "/my_directory/bench/run_experiments.py", line 70, in run_experiments
) = run_algorithm(algorithm_params[j], mp[j], ss, dataset)
File "/my_directorybench/algorithms.py", line 38, in run_algorithm
data = es(mp,search_space, dataset, **ps)
File "/my_directorybench/algorithms.py", line 151, in es
data = ss.generate_random_dataset(mp,
File "/my_directorybench/architectures.py", line 241, in generate_random_dataset
arch_dict = self.query_arch(
File "/my_directory/bench/architectures.py", line 71, in query_arch
train_losses, val_losses, model = meta_net.get_val_loss(
File "/my_directory/bench/meta_neural_net.py", line 50, in get_val_loss
return self.training(
File "/my_directorybench/meta_neural_net.py", line 155, in training
train_loss = self.train_step(model, device, train_loader, epoch)
File "/my_directory/bench/meta_neural_net.py", line 179, in train_step
for batch_idx, mini_batch in enumerate(pbar):
File "/my_directory/.conda/envs/geoseg/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
idx, data = self._get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
success, data = self._try_get_data()
File "/my_directory/.local/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/my_directory/.local/lib/python3.9/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
fd = df.detach()
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 189, in recv_handle
return recvfds(s, 1)[0]
File "/my_directory/.conda/envs/geoseg/lib/python3.9/multiprocessing/reduction.py", line 159, in recvfds
raise EOFError
EOFError
EDIT : The way data is accessed
from PIL import Image
from torch.utils.data import DataLoader
# extract of code of dataset
class Dataset():
def __init__(self,image_files,mask_files):
self.image_files = image_files
self.mask_files = mask_files
def __getitem__(self, idx):
img = Image.open(self.image_files[idx]).convert('RGB')
mask=Image.open(self.mask_files[idx]).convert('L')
return img, mask
# extract of code of trainloader
train_loader = DataLoader(
dataset=train_dataset,
batch_size=4,
num_workers=4,
pin_memory=False,
shuffle=True,
drop_last=True,
persistent_workers=False,
)
I have finally found a solution. Adding this configuration to the dataset script works:
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
By default, the sharing strategy is set to 'file_descriptor'
.
I have tried some solutions explained in :
- this issue (increase shared memory, increase max number of opened file descriptors, torch.cuda.empty_cache() at the end of each epoch, …)
- and this other issue, that turns out to solve the problem
As suggested by @AlexMeredith, the error may be linked to the distributed filesystem (Lustre) that some clusters use. The error may also come from distributed shared memory.
In this example there’s only dataset implementation but there’s no snippet showing what’s happening with the batches.
In my case, I was storing the batches in the index array-like object which fortunately has been described here. The dataloader couldn’t close the subprocesses because of that. Implementing something similar to that helped me solving this problem.
import copy
for batch in data_loader:
batch_cp = copy.deepcopy(batch)
del batch
index.append(batch_cp["index"])
I also got other errors related to this one, such as:
received 0 items of ancdata
bad message length