Beyond Data and Model Parallelism for Deep Neural Networks

Existing deep learning frameworks either use data or model parallelism as distribution strategy. However, these strategies often produce sub-optimal performance for many models. In data parallel distribution, the computation graph is partitioned on the data sample dimension. Model parallelism on the other hand partitions the parameters at a layer-wise granularity. In this paper, the authors proposes additional dimensions for graph partitioning for a more comprehensive search space for parallelization strategies. They propose FlexFlow, a DNN framework which can automatically distribute any DNN over a cluster given the computation graph and the cluster topology.

The key idea of this paper is called SOAP, which stands for Sample, Operation, Attributes and Parameters. Considering these additional dimensions while partitioning the workload can bring forth better distribution by reducing the overall communication cost. FlexFlow has 2 main components: simulator and optimizer. On a high level, FlexFlow initially picks a random partition strategy and calculates the execution time using the simulator. Then the optimizer performs Markov Chain Monte Carlo (MCMC) search on the search space by making small modifications on the current strategy. This process is repeated until the optimizer finds the best strategy or the time budget has exhausted (set to 30 min in the paper).Most of the previous works try to improve data and model parallelism. However, doing model parallelism required considerable manual effort. A recent work called REINFORCE used reinforcement learning techniques to find out the best device placement strategy for large models. This method only considers operation level partition. It also take considerable amount of time to find an optimal strategy. OptCNN (from the same team) performs partitioning on SAP dimensions but only for CNN based models. FlexFlow improves upon these methods by exploring broader search space for partitioning and also finding optimal strategy order of magnitudes faster.

FlexFlow defines Pi as the partition dimensions and Ci as the partition configuration for an operation Oi. Following figure shows example partition dimensions for an operator.

Parallelization configurations for 1D-convolution. Dashed lines show partitioning the tensor.

For example, in case of convolution layer, the operator can be partitioned along the samples (data parallelism), along the channels, height, width or length. The configuration Ci would define the number of partitions of each dimension as a tuple (sample=1, channel=1, height=2, width=2). In addition to the partition configuration, FlexFlow takes a device topology D = (DN,DE) as input. DN is a set of all devices and DE is all the connection between devices with the bandwidth of the connection. A mapping between partition configuration and the devices is called a partition strategy.

Given a partition strategy, FlexFlow generates a task graph as shown below.

Parallelization configuration for a matrix multiplication operation.

The task graph is generated by introducing task nodes corresponding to every partition in the configuration. The dependencies are assigned between partitions based on the data flow. If the task nodes are placed on different devices, FlexFlow creates a new task node representing data transfer operation. So the final task graph is a set of task nodes (either computation or communication tasks) interconnected by edges representing dependencies between them.

The execution simulator takes this task graph as an input and simulates the training process in order to calculate the overall execution time. The simulator measures the execution time for each individual computation tasks by running them on the real hardware for a few iterations and takes average of that. This value is cached and reused for subsequent simulations. The time taken for communication operations are calculated by dividing the data size by the bandwidth of the connections. This simulation models makes a lot of assumptions about the workload.

  • The execution time of the computation tasks are predictable and does not vary with respect to input
  • Communication tasks can fully utilize the available communication bandwidth
  • The runtime overhead is negligible and the tasks are executed in FIFO order.

The paper claims these assumptions are valid for most of the DNN models. Since the execution optimizer performs MCMC search algorithm on the strategy space by making modifications to the partition strategy for one operation at a time, the simulator only simulates the modified portion of the task graph as an optimization. This is called delta simulation.

Now the execution optimizer starts from an arbitrary strategy (either data parallelism or expert designed one) and runs MCMC search through the search space. For each strategy, it runs the simulator to find out if the search is following the right direction. The search is continued until the optimizer finds the best strategy or the time budget has finished (30 mins).

FlexFlow, is evaluated on a 4 node cluster each with 4 GPUs (P100 or K80) and interconnected by InfiniBand (100GBps or 56 GBps). Compared to baseline parallelization strategies like data parallel, model parallel, REINFORCE and OptCNN, FlexFlow achieves 1.2 – 3.8x speed up. The reduction in communication cost is around 2 – 5.5x. Compared to OptCNN which uses more or less similar ideas for distribution but on a smaller search space dimensions, FlexFlow improves the training performance by 1.2 – 1.6x. The simulator accuracy is between 70 – 100%.

Share me

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *