Data parallel training using synchronous SGD is one of the most popular distribution techniques practiced for deep learning models. This method, however, is highly communication bounded since it has to synchronize huge amount of data on every iteration of the training process. P3 is a model-agnostic, easy-to-implement and efficient parameter synchronization mechanism to reduce this communication overhead by taking advantage of the domain specific nature of the SGD algorithm.
This project is aiming to standardize the performance analysis of deep learning training workloads by proposing an extensive benchmark suite with eight different state-of-the-art archetypal deep neural network (DNN) models from six major application domains. The applications in this suite are selected based on extensive conversations with ML developers and users from both industry and academia. This is project is actively maintained with contributions and feedback from the community.
GPU manufacturers are building more powerful and specialized hardware in order to cater the ever-increasing demand for computing power for Deep Neural Networks (DNN) training. The goal of this study is to analyze how well these modern GPUs can handle the myriad of DNN-based applications currently popular in the community. In this study, we focus on benchmarking, analyzing and comparing the performance of 8 popular GPUs using 8 archetypal DNN models.
- Dynamic Scheduling for Model Parallel DNN Training
- Recent trends indicate that the DNNs are growing in both parameter size and complexity. For example, models like Mixture-of-Experts has hundreds of billions of parameters and dynamic control flows. Training such models using static device placement and memory allocation mechanisms followed in current distribution techniques could cause heavy over-provisioning of device memory and wastage of computing cycles. We are trying to address this problem by dynamically scheduling the computation graph based on runtime information.
- Debugging Machine Learning Models
- In contrast to regular programs, it is much harder to debug ML code, especially for models with complex architecture. Most previous research tries to detect the presence of bugs and explain model predictions by tracing back to the input data. However, it is hard to find the source of the bug in the implementation using these methods. This project is investigating into the possibility of formally verifying a model by matching the model specification to the implementation.
- Iroko: RL based solution for data center congestion control
- A centralized reinforcement learning (RL) based congestion control system for tightly controlled data center networks. Built a data center emulator on Mininet and explored effectiveness of various RL algorithms like REINFORCE and DDPG.
- F(O)MG: Few-shOt Music Generation
- Model to generate genre-specific music from a few training samples using meta-learning approach. FOMG demonstrated faster training and better generation quality compared to the baseline auto-regression model.
- SASOX: A single address space operating system for x86
- A Unix like single address space operating system for x86 architecture based on xv6. Evaluating against baseline xv6 implementation, SOSAX produced superior improvement in performance for multi-threaded programs.
- Graph Expanders and Its Applications
- Formally proved the equality of complexity classes symmetric-logspace (SL) and logspace (L) using the properties of s-t connectivity in expander graphs.