How does asynchronous training work in distributed Tensorflow?

Question:

I’ve read Distributed Tensorflow Doc, and it mentions that in asynchronous training,

each replica of the graph has an independent training loop that executes without coordination.

From what I understand, if we use parameter-server with data parallelism architecture, it means each worker computes gradients and updates its own weights without caring about other workers updates for distributed training Neural Network. As all weights are shared on parameter server (ps), I think ps still has to coordinate (or aggregate) weight updates from all workers in some way. I wonder how does the aggregation work in asynchronous training. Or in more general words, how does asynchronous training work in distributed Tensorflow?

Asked By: Ruofan Kong

||

Answers:

Looking at the example in the documentation you link to:

with tf.device("/job:ps/task:0"):
  weights_1 = tf.Variable(...)
  biases_1 = tf.Variable(...)

with tf.device("/job:ps/task:1"):
  weights_2 = tf.Variable(...)
  biases_2 = tf.Variable(...)

with tf.device("/job:worker/task:7"):
  input, labels = ...
  layer_1 = tf.nn.relu(tf.matmul(input, weights_1) + biases_1)
  logits = tf.nn.relu(tf.matmul(layer_1, weights_2) + biases_2)
  # ...
  train_op = ...

with tf.Session("grpc://worker7.example.com:2222") as sess:
  for _ in range(10000):
    sess.run(train_op)

You can see that the training is distributed on three machines which all share a copy of identical weights, but as is mentioned just below the example:

In the above example, the variables are created on two tasks in the ps job, and the compute-intensive part of the model is created in the worker job. TensorFlow will insert the appropriate data transfers between the jobs (from ps to worker for the forward pass, and from worker to ps for applying gradients).

In other words, one gpu is used to calculate the forward pass and then transmits the results to the other two machines, while each of the other machines calculate the back propagation for a part of the weights and then send the results to the other machines so they can all update their weights appropriately.

GPUs are used to speed up matrix multiplications and parallel mathematical operations which are very intensive for both forward pass and back propagation. So distributed training simply means that you distribute these operations on many GPUs, the model is still synced between the machines, but now the back propagation of different weights can be calculated in parallel and the forward pass on a different mini-batch can be calculated at the same time as backprop from the previous mini-batch is still being calculated. Distributed training does not mean that you have totally independent models and weights on each machine.

Answered By: patapouf_ai

In asynchronous training there is no synchronization of weights among the workers. The weights are stored on the parameter server. Each worker loads and changes the shared weights independently from each other. This way if one worker finished an iteration faster than the other workers, it proceeds with the next iteration without waiting. The workers only interact with the shared parameter server and don’t interact with each other.

Overall it can (depending on the task) speedup the computation significantly. However the results are sometimes worse than the ones obtained with the slower synchronous updates.

Answered By: BlueSun

When you train asynchronously in Distributed TensorFlow, a particular worker does the following:

  1. The worker reads all of the shared model parameters in parallel from the PS task(s), and copies them to the worker task. These reads are uncoordinated with any concurrent writes, and no locks are acquired: in particular the worker may see partial updates from one or more other workers (e.g. a subset of the updates from another worker may have been applied, or a subset of the elements in a variable may have been updated).

  2. The worker computes gradients locally, based on a batch of input data and the parameter values that it read in step 1.

  3. The worker sends the gradients for each variable to the appropriate PS task, and applies the gradients to their respective variable, using an update rule that is determined by the optimization algorithm (e.g. SGD, SGD with Momentum, Adagrad, Adam, etc.). The update rules typically use (approximately) commutative operations, so they may be applied independently on the updates from each worker, and the state of each variable will be a running aggregate of the sequence of updates received.

In asynchronous training, each update from the worker is applied concurrently, and the updates may be somewhat coordinated if the optional use_locking=True flag was set when the respective optimizer (e.g. tf.train.GradientDescentOptimizer) was initialized. Note however that the locking here only provides mutual exclusion for two concurrent updates, and (as noted above) reads do not acquire locks; the locking does not provide atomicity across the entire set of updates.

(By contrast, in synchronous training, a utility like tf.train.SyncReplicasOptimizer will ensure that all of the workers read the same, up-to-date values for each model parameter; and that all of the updates for a synchronous step are aggregated before they are applied to the underlying variables. To do this, the workers are synchronized by a barrier, which they enter after sending their gradient update, and leave after the aggregated update has been applied to all variables.)

Answered By: mrry