Many of the deep learning problems which industries are trying to tackle are complex. While it is relatively easy to build an early proof of concept (POC) of a system, it takes a huge amount of effort to build a solution that meets all functional and non-functional requirements.
For example, it’s straightforward to build a POC for self-driving vehicles that will drive across a small number of streets with human supervision. On the other hand, building a self-driving car which is robust and safe is an engineering feat requiring petabytes of data for training and validation.
In this session we tackle the key challenges faced when developing complex deep learning systems and focus on the algorithmic challenges involved in large scale training, distributed training algorithms and the degradation of performance associated with large batch sizes and engineering challenges involved in designing and utilizing large-scale training.