How to set random seed when it is in distributed training in PyTorch?
Question:
Now I am training a model using torch.distributed
, but I am not sure how to set the random seeds. For example, this is my current code:
def main():
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
cudnn.enabled = True
cudnn.benchmark = True
cudnn.deterministic = True
mp.spawn(main_worker, nprocs=args.ngpus, args=(args,))
And should I move the
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
cudnn.enabled = True
cudnn.benchmark = True
cudnn.deterministic = True
into the function main_worker()
to make sure every process has the correct seed and cudnn settings? By the way, I have tried this, and this behavior will make the training 2 times slower, which really confused me.
Thank you very much for any help!
Answers:
The spawned child processes do not inherit the seed you set manually in the parent process, therefore you need to set the seed in the main_worker
function.
The same logic applies to cudnn.benchmark
and cudnn.deterministic
, so if you want to use these, you have to set them in main_worker
as well. If you want to verify that, you can just print their values in each process.
cudnn.benchmark = True
tries to find the optimal algorithm for your model, by benchmarking various implementations of certain operations (e.g. available convolution algorithms). This will take time to find the best algorithm, but once that is done, further iterations will potentially be faster. The algorithm that was determined to be the best, only applies to the specific input size that was used. If in the next iteration you have a different input size, the benchmark needs to be run again, in order to determine the best algorithm for that specific input size, which might be a different one than for the first input size.
I’m assuming that your input sizes vary, which would explain the slow down, as the benchmark wasn’t used when it was set in the parent process. cudnn.benchmark = True
should only be used if your input sizes are fixed.
cudnn.determinstic = True
may also have a negative impact on the performance, because certain underlying operations, that are non-deterministic, need to be replaced with a deterministic version, which tend to be slower, otherwise the deterministic version would be used in the first place, but that performance impact shouldn’t be too dramatic.
Now I am training a model using torch.distributed
, but I am not sure how to set the random seeds. For example, this is my current code:
def main():
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
cudnn.enabled = True
cudnn.benchmark = True
cudnn.deterministic = True
mp.spawn(main_worker, nprocs=args.ngpus, args=(args,))
And should I move the
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)
cudnn.enabled = True
cudnn.benchmark = True
cudnn.deterministic = True
into the function main_worker()
to make sure every process has the correct seed and cudnn settings? By the way, I have tried this, and this behavior will make the training 2 times slower, which really confused me.
Thank you very much for any help!
The spawned child processes do not inherit the seed you set manually in the parent process, therefore you need to set the seed in the main_worker
function.
The same logic applies to cudnn.benchmark
and cudnn.deterministic
, so if you want to use these, you have to set them in main_worker
as well. If you want to verify that, you can just print their values in each process.
cudnn.benchmark = True
tries to find the optimal algorithm for your model, by benchmarking various implementations of certain operations (e.g. available convolution algorithms). This will take time to find the best algorithm, but once that is done, further iterations will potentially be faster. The algorithm that was determined to be the best, only applies to the specific input size that was used. If in the next iteration you have a different input size, the benchmark needs to be run again, in order to determine the best algorithm for that specific input size, which might be a different one than for the first input size.
I’m assuming that your input sizes vary, which would explain the slow down, as the benchmark wasn’t used when it was set in the parent process. cudnn.benchmark = True
should only be used if your input sizes are fixed.
cudnn.determinstic = True
may also have a negative impact on the performance, because certain underlying operations, that are non-deterministic, need to be replaced with a deterministic version, which tend to be slower, otherwise the deterministic version would be used in the first place, but that performance impact shouldn’t be too dramatic.