How do data scientists train one machine learning model utilizing multiple worker nodes like GPU and CPU machines?
Assume we have a parameter server which holds the model parameters, and many workers. For one iteration, each worker receives model parameters from the parameter server and a chunk of data from somewhere else. The workers compute the gradients and send them back to the parameter server. The parameter server waits until it receives gradients from all workers then it updates the model parameters, and the iteration goes on. This straightforward approach is called synchronous stochastic gradient descent (SSGD), which looks good at first glance.
But, there are actually 3 bottlenecks in this approach.
First, the batch size imposes an upper bound on data partitioning, as from experiments people learned that using batch size larger than 64 often leads to extra time for model convergence. Thus, if the batch size is set to 64 then it is constrained that at most 64 workers can be employed.
Second, the demand for network bandwidth peaks when workers send the gradients to the parameter server within a short amount of time, which likely causes delay.
Third, and probably the most worrying, workers vary in performance and job finish time. Therefore, the parameter server, has to wait for an indefinite amount of time before it receives all gradients and can do the parameter update. Faster workers are also waiting for slower workers to finish.
These 3 issues are inherited from the structure of SSGD.
Researchers then came up with asynchronous stochastic gradient descent (ASGD) in which the parameter server updates the model parameter as soon as receiving the result of a worker and sends the new parameters to that worker to move on. This approach solves the 3 issues of SSGD but may lead to a serious problem, gradient staleness, as gradients computed by workers are often based on obsolete parameters. Gradient staleness is proven to be a huge problem that deteriorates the performance of the final model.
Professor Assaf presented a work called distributed adaptive NAG ASGD. In the proposed mechanism, the parameter server sends out prediction of the parameters instead of the current parameters such that workers can compute gradients based on parameters closer to the actual parameters.
In my opinion, ASGD has its place in the machine learning world and is suitable for researchers who desire to get the result faster. But for those who seek best-performing model, go for SSGD!
Another topic to talk about is model decomposition, it is needed when the RAM of workers cannot store the entire model parameters.
The algorithm of PipeDream took the idea from CPU process pipelining and applied it to model decomposition. However, the staleness issue appears when the back propagation starts and the parameters change. Professor Assaf presented a work called Fine Tuning Pipe (FTPipe) to address this.