The Blueprint
Posts
Intuition behind Distributed Training: Busy Engineer Edition

Intuition behind Distributed Training: Busy Engineer Edition

Distributed training seems complex at first with all the deep learning jargon but is pretty intuitive... let's understand it from scratch

Omkaar Kamath
November 12, 2024

GM Busy Engineers 🫡

This article is a bit different… I am not covering any tech blog breakdown. Instead, I am explaining the intuition behind one of my own projects that went micro viral and that seemingly kickstarted the Mac GPU cluster revolution.

Distributed training is traditionally done on a group of GPU nodes (1 node = a group of GPUs) but my simple project demonstrates it across consumer laptops. Keep in mind, this is not for in-production use at all. With that, let’s dive right in 🏊

Quick background on Deep Learning (DL)

The AI/ML bubble has got most of us interested in the space but do we really understand DL at a conceptual level?

The whole point of DL is to learn / approximate a function F given X input and Y labels: F(X) = Y. For example, to classify if an image contains a hot dog or not, X = images of hot dogs and other objects, Y = Array of 1’s and 0’s if images have hotdogs or not. We use X and Y to train F and on training, use F and X (new data) to infer Y (predictions).

Once we have a rough idea of what F looks like (architecture-wise), there are three steps to learning F that’s done in a loop:

Forward pass - Using the initial model on single data batch (group of data points) to calculate what the non-optimized model predicts. Think of this as making a guess with our current knowledge.
Backward pass - Looking at how wrong our guesses were (called loss) to figure out which parts of our model contributed most to the total error.
Weight update - Tweaking the model's parameters (weights) based on what we learned from our mistakes. Small adjustments are made to make the model slightly better at the task.

Out of the three, can you guess the most computationally expensive part? That’s right - the Backward Pass! It involves complex linear algebra → calculation-heavy.

Credit: Nomidl.com

For future reference, this is what single GPU model training looks like:

Fig. 1: Single machine training loop

So, what’s the deal with distributed training?

In classical computing, when we have a compute-bound task, two ways to increase throughput is to get a more powerful processor and/or get multiple processors (vertical vs horizontal scaling).

Similarly, to perform more computations in a GPU setting, we use more powerful GPUs and/or connect more of them over PCIe/NVLink (super fast). GPUs vary in memory (VRAM) and floating-point operations / second.

One fact to keep in mind is when involving multiple machines, network costs tend to add up a lot, companies like NVIDIA spend a lot of research dollars on better interconnect (link between GPUs) to shave off seconds in network latency and increase network bandwidth.

Distributed training is used when the model size > 1 GPU’s memory or to speed up training by parallelizing across GPUs, these are called:

Data Parallelism

In this approach, you split your dataset into many batches, multiple individual machines train on their own assigned data batch, calculate the new weights and then the new weights for all machines are averaged and finally set on all machines.

Fig. 2: Training loop with data parallelism

When you compare Fig. 1 and 2, the two key differences are that each training loop now consumes two data batches at the same time and both machines get new averaged weights. The average and update step is called reduction and there are multiple ways to perform this.

Model Parallelism

When large models can’t fit on one GPU, we split the model (multiple ways of doing this) and assign each part to a GPU / machine.

Fig. 3: Training loop with Model parallelism

As shown, the forward and backward pass of the full model happens over multiple machines sequentially (think about the network costs 😱).

Implementation: the fun part

You can have a look at my reference code here, the Readme should make everything clear. My code simply implements a generic Training Loop with Data Parallelism in Python and network calls in gRPC.

I opted for the Master-Slave (Leader-Learner) architecture where all the sync steps are performed on Master. These are the different steps performed:

User adds their model to ./model_artifacts, data under ./data, then starts leader and learners on the same network with the right arguments.
Once the specified # of learners join, training automatically starts
Each learner gets unique batches of data from the leader
All workers compute gradients on local batches and send them back to leader
The leader averages these gradients, applies them to the global model, and broadcasts the updated model back to all learners
This happens in a loop till all data is consumed

Fig. 4: Different steps in a training loop

Final Notes

While gRPC streaming helps, the big limitation with this setup is syncing gradients on every data batch is extremely inefficient. I ran a training run over a 200Mbps network with 3 M1 Macbooks and it took a LONG time to finish training.

Please note that for the sake of brevity and keeping concepts simple, I have omitted some background information.

Wanna dive deeper? The repository is open for contributions, and I'm particularly interested in seeing how people adapt it for different use cases. Whether you have questions, ideas, or run into interesting challenges, feel free to reach out or open an issue on GitHub.

Stay bizzy!