distributed

importing dask_cuda results in parse_memory_limit error

importing dask_cuda results in parse_memory_limit error Question: I’m trying to import dask_cuda as the examples: from dask_cuda import LocalCUDACluster from dask.distributed import Client But I receive the following error: ImportError Traceback (most recent call last) Input In [3], in <cell line: 1>() —-> 1 from dask_cuda import LocalCUDACluster File ~/miniconda3/lib/python3.8/site-packages/dask_cuda/__init__.py:5, in <module> 2 import dask.dataframe.shuffle …

Total answers: 1

Pytorch Python Distributed Multiprocessing: Gather/Concatenate tensor arrays of different lengths/sizes

Pytorch Python Distributed Multiprocessing: Gather/Concatenate tensor arrays of different lengths/sizes Question: If you have tensor arrays of different lengths across several gpu ranks, the default all_gather method does not work as it requires the lengths to be same. For example, if you have: if gpu == 0: q = torch.tensor([1.5, 2.3], device=torch.device(gpu)) else: q = …

Total answers: 2

How to set random seed when it is in distributed training in PyTorch?

How to set random seed when it is in distributed training in PyTorch? Question: Now I am training a model using torch.distributed, but I am not sure how to set the random seeds. For example, this is my current code: def main(): np.random.seed(args.seed) torch.manual_seed(args.seed) torch.cuda.manual_seed(args.seed) cudnn.enabled = True cudnn.benchmark = True cudnn.deterministic = True mp.spawn(main_worker, …

Total answers: 1

How does asynchronous training work in distributed Tensorflow?

How does asynchronous training work in distributed Tensorflow? Question: I’ve read Distributed Tensorflow Doc, and it mentions that in asynchronous training, each replica of the graph has an independent training loop that executes without coordination. From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates …

Total answers: 3

TensorFlow placement algorithm

TensorFlow placement algorithm Question: I would like to know when the placement algorithm of TensorFlow (as described in the white paper) gets actually employed. All examples for distributing TensorFlow that I have seen so far seem to specify manually where the nodes should be executed on, using tf.device(). Asked By: PaulWen || Source Answers: The …

Total answers: 1

Distributing jobs over multiple servers using python

Distributing jobs over multiple servers using python Question: I currently has an executable that when running uses all the cores on my server. I want to add another server, and have the jobs split between the two machines, but still each job using all the cores on the machine it is running. If both machines …

Total answers: 2