How does pytorch's parallel method and distributed method work? How does pytorch's parallel method and distributed method work? python-3.x python-3.x

How does pytorch's parallel method and distributed method work?


That's a great question.
PyTorch DataParallel paradigm is actually quite simple and the implementation is open-sourced here . Note that his paradigm is not recommended today as it bottlenecks at the master GPU and not efficient in data transfer.

This container parallelizes the application of the given :attr:module bysplitting the input across the specified devices by chunking in the batchdimension (other objects will be copied once per device). In the forwardpass, the module is replicated on each device, and each replica handles aportion of the input. During the backwards pass, gradients from each replicaare summed into the original module.

As of DistributedDataParallel, thats more tricky. This is currently the more advanced approach and it is quite efficient (see here).

This container parallelizes the application of the given module bysplitting the input across the specified devices by chunking in the batchdimension. The module is replicated on each machine and each device, andeach such replica handles a portion of the input. During the backwardspass, gradients from each node are averaged.

There are several approaches towards how to average the gradients from each node. I would recommend this paper to get a real sense how things work. Generally speaking, there is a trade-off between transferring the data from one GPU to another, regarding bandwidth and speed, and we want that part to be really efficient. So one possible approach is to connect each pairs of GPUs with a really fast protocol in a circle, and to pass only part of gradients from one to another, s.t. in total, we transfer less data, more efficiently, and all the nodes get all the gradients (or their average at least). There will still be a master GPU in that situation, or at least a process, but now there is no bottleneck on any GPU, they all share the same amount of data (up to...).

Now this can be further optimized if we don't wait for all the batches to finish compute and start do a time-sharing thing where each node sends his portion when he's ready. Don't take me on the details, but it turns out that if we don't wait for everything to end, and do the averaging as soon as we can, it might also speed up the gradient averaging.

Please refer to literature for more information about that area as it is still developing (as of today).

PS 1: Usually these distributed training work better on machines that are set for that task, e.g. AWS deep learning instances that implement those protocols in HW.

PS 2: Disclaimer: I really don't know what protocol PyTorch devs chose to implement and what is chosen according to what. I work with distributed training and prefer to follow PyTorch best practices without trying to outsmart them. I recommend for you to do the same unless you are really into researching this area.

References:

[1] Distributed Training of Deep Learning Models: A Taxonomic Perspective